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Abstract — While Kolmogorov complexity is the accepted 
absolute measure of information content of an individual fi- 
nite object, a similarly absolute notion is needed for the re- 
lation between an individual data sample and an individual 
model summarizing the information in the data, for exam- 
ple, a finite set (or probability distribution) where the data 
sample typically came from. The statistical theory based 
on such relations between individual objects can be called 
algorithmic statistics, in contrast to classical statistical the- 
ory that deals with relations between probabilistic ensem- 
bles. We develop the algorithmic theory of statistic, suffi- 
cient statistic, and minimal sufficient statistic. This theory 
is based on two-part codes consisting of the code for the 
statistic (the model summarizing the regularity, the mean- 
ingful information, in the data) and the model-to-data code. 
In contrast to the situation in probabilistic statistical the- 
ory, the algorithmic relation of (minimal) sufficiency is an 
absolute relation between the individual model and the in- 
dividual data sample. We distinguish implicit and explicit 
descriptions of the models. We give characterizations of al- 
gorithmic (Kolmogorov) minimal sufficient statistic for all 
data samples for both description modes — in the explicit 
mode under some constraints. We also strengthen and elab- 
orate earlier results on the "Kolmogorov structure function" 
and "absolutely non-stochastic objects" — those rare objects 
for which the simplest models that summarize their rele- 
vant information (minimal sufficient statistics) are at least 
as complex as the objects themselves. We demonstrate a 
close relation between the probabilistic notions and the al- 
gorithmic ones: (i) in both cases there is an "information 
non-increase" law; (ii) it is shown that a function is a prob- 
abilistic sufficient statistic iff it is with high probability (in 
an appropriate sense) an algorithmic sufficient statistic. 

Keywords — Algorithmic information theory; description 
format, explicit, implicit; foundations of statistics; Kol- 
mogorov complexity; minimal sufficient statistic, algorith- 
mic; mutual information, algorithmic; nonstochastic ob- 
jects; sufficient statistic, algorithmic; two-part codes. 



I. Introduction 

STATISTICAL theory ideally considers the following 
problem: Given a data sample and a family of mod- 
els (hypotheses), select the model that produced the data. 
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But a priori it is possible that the data is atypical for the 
model that actually produced it, or that the true model is 
not present in the considered model class. Therefore we 
have to relax our requirements. If selection of a "true" 
model cannot be guaranteed by any method, then as next 
best choice "modeling the data" as well as possible irre- 
spective of truth and falsehood of the resulting model may 
be more appropriate. Thus, we change "true" to "as well 
as possible." The latter we take to mean that the model 
expresses all significant regularity present in the data. The 
general setting is as follows: We carry out a probabilistic 
experiment of which the outcomes are governed by an un- 
known probability distribution P. Suppose we obtain as 
outcome the data sample x. Given x, we want to recover 
the distribution P. For certain reasons we can choose a dis- 
tribution from a set of acceptable distributions only (which 
may or may not contain P). Intuitively, our selection cri- 
teria are that (i) x should be a "typical" outcome of the 
distribution selected, and (ii) the selected distribution has 
a "simple" description. We need to make the meaning of 
"typical" and "simple" rigorous and balance the require- 
ments (i) and (ii). In probabilistic statistics one analyzes 
the average-case performance of the selection process. For 
traditional problems, dealing with frequencies over small 
sample spaces, this approach is appropriate. But for cur- 
rent novel applications, average relations are often irrele- 
vant, since the part of the support of the probability density 
function that will ever be observed has about zero measure. 
This is the case in, for example, complex video and sound 
analysis. There arises the problem that for individual cases 
the selection performance may be bad although the perfor- 
mance is good on average. We embark on a systematic 
study of model selection where the performance is related 
to the individual data sample and the individual model 
selected. It turns out to be more straightforward to inves- 
tigate models that are finite sets first, and then generalize 
the results to models that are probability distributions. To 
simplify matters, and because all discrete data can be bi- 
nary coded, we consider only data samples that are finite 
binary strings. 

This paper is one of a triad of papers dealing with the 
best individual model for individual data: The present pa- 
per supplies the basic theoretical underpinning by way of 
two-part codes, Q derives ideal versions of applied meth- 
ods (MDL) inspired by the theory, and || treats experi- 
mental applications thereof. 

Probabilistic Statistics: In ordinary statistical the- 
ory one proceeds as follows, see for example f|: Suppose 
two discrete random variables X, Y have a joint probability 
mass function p(x, y) and marginal probability mass func- 
tions pi{x) = J2 y p( x >y) and P2(y) = J2 x p( x ^y)- Then 

the (probabilistic) mutual information I(X; Y) between the 
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joint distribution and the product distribution pi(x)p2{y) 
is defined by: 



i{x-y) =Y,Y.p^y) lo s 



x y 



p{x,y) 
p\{x)p2{y)' 



(Li) 



where "log" denotes the binary logarithm. Consider a 
probabilistic ensemble of models, say a family of probability 
mass functions {fg} indexed by 9, together with a distribu- 
tion pi over 9. This way we have a random variable with 
outcomes in {fg} and a random variable D with outcomes 
in the union of domains of fg, and p{9,d) = pi(9)fg(d). 
Every function T(D) of a data sample D — like the sample 
mean or the sample variance — is called a statistic of D. A 
statistic T(D) is called sufficient if the probabilistic mutual 
information 



/(©;£>)= 7(0; T(£>)) 



(1.2) 



for all distributions of 9. Hence, the mutual information 
between parameter and data sample random variables is 
invariant under taking sufficient statistic and vice versa. 
That is to say, a statistic T(D) is called sufficient for if 
it contains all the information in D about 0. For example, 
consider n tosses of a coin with unknown bias 9 with out- 
come D = d\d 2 . ■ ■ d n where di S {0, 1} (1 < i < n). Given 
n, the number of outcomes "1" is a sufficient statistic for 0: 
the statistic T(D) = s = J27=i di- Given T, all sequences 
with s "l"s are equally likely independent of parameter 9: 
Given s, if d is an outcome of n coin tosses and T(D) = s 
then Pr(d | T(D) = s) = (™) _1 and Pr(d | T(D) ^ s) = 0. 
This can be shown to imply ( [L2"| ) and therefore T is a suffi- 
cient statistic for 0. According to Fisher "The statistic 
chosen should summarise the whole of the relevant infor- 
mation supplied by the sample. This may be called the Cri- 
terion of Sufficiency ... In the case of the normal curve of 
distribution it is evident that the second moment is a suffi- 
cient statistic for estimating the standard deviation." Note 
that one cannot improve on sufficiency: for every (possibly 
randomized) function T we have 



I(Q;D)>I(Q;T(D)), 



(1.3) 



that is, mutual information cannot be increased by pro- 
cessing the data sample in any way. 

A sufficient statistic may contain information that is not 
relevant: for a normal distribution the sample mean is a 
sufficient statistic, but the pair of functions which give the 
mean of the even-numbered samples and the odd-numbered 
samples respectively, is also a sufficient statistic. A statis- 
tic T(D) is a minimal sufficient statistic with respect to 
an indexed model family {fg}, if it is a function of all 
other sufficient statistics: it contains no irrelevant infor- 
mation and maximally compresses the information about 
the model ensemble. As it happens, for the family of nor- 
mal distributions the sample mean is a minimal sufficient 
statistic, but the sufficient statistic consisting of the mean 
of the even samples in combination with the mean of the 
odd samples is not minimal. All these notions and laws are 
probabilistic: they hold in an average sense. 



Kolmogorov Complexity: We write string to mean a 
finite binary sequence. Other finite objects can be encoded 
into strings in natural ways. The Kolmogorov complexity, 
or algorithmic entropy, K(x) of a string x is the length of a 
shortest binary program to compute x on a universal com- 
puter (such as a universal Turing machine). Intuitively, 
K{x) represents the minimal amount of information re- 
quired to generate x by any effective process, jyj. The 
conditional Kolmogorov complexity K (x \ y) of x relative 
to y is defined similarly as the length of a shortest pro- 
gram to compute x if y is furnished as an auxiliary input 
to the computation. This conditional definition requires a 
warning since different authors use the same notation but 
mean different things. In || the author writes ll K(x | y)" 
to actually mean u K(x \ y,K(y))" notationally hiding 
the intended supplementary auxiliary information "i4f(y)." 
This abuse of notation has the additional handicap that 
no obvious notation is left to express ll K(x | j/)" meaning 
that just "y" is given in the conditional. As it happens, 
li y,K(y)" represents more information than just "y". For 
example, K(K{y) \ y) can be almost as large as \ogK(y) 
by a result in |7|: For l(y) = n it has an upper bound of 
logn for all y, and for some y's it has a lower bound of 
logn — log logn. In fact, this result quantifies the unde- 
cidability of the halting problem for Turing machines — for 
example, if K(K(y) \ y) = 0(1) for all y, then the halting 
problem can be shown to be decidable. This is known to 
be false. It is customary, jl4j], |Q, p0| , to write explicitly 
"K(x | y)" and u K(x \ y,K[y)f . Even though the differ- 
ence between these two quantities is not very large, these 
small differences do matter in the sequel. In fact, not only 
the precise information itself in the conditional, but also 



the way it is represented, is crucial, see Subsection III- A. 

The functions K(-) and K(- | ■), though defined in terms 
of a particular machine model, are machine-independent 
up to an additive constant and acquire an asymptotically 
universal and absolute character through Church's thesis, 
from the ability of universal machines to simulate one an- 
other and execute any effective process. The Kolmogorov 
complexity of a string can be viewed as an absolute and 
objective quantification of the amount of information in it. 
This leads to a theory of absolute information contents of 
individual objects in contrast to classical information the- 
ory which deals with average information to communicate 
objects produced by a random source. Since the former 
theory is much more precise, it is surprising that analogs 
of theorems in classical information theory hold for Kol- 
mogorov complexity, be it in somewhat weaker form. Here 
our aim is to provide a similarly absolute notion for indi- 
vidual "sufficient statistic" and related notions borrowed 
from probabilistic statistics. 

Two-part codes: The prefix-code of the shortest effec- 
tive descriptions gives an expected code word length close 
to the entropy and also compresses the regular objects until 
all regularity is squeezed out. All shortest effective descrip- 
tions are completely random themselves, without any regu- 
larity whatsoever. The idea of a two-part code for a body of 
data d is natural from the perspective of Kolmogorov com- 
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plcxity. If d does not contain any regularity at all, then it 
consists of purely random data and the model is precisely 
that. Assume that the body of data d contains regularity. 
With help of a description of the regularity (a model) we 
can describe the data compactly. Assuming that the reg- 
ularity can be represented in an effective manner (that is, 
by a Turing machine) , we encode the data as a program for 
that machine. Squeezing all effective regularity out of the 
data, we end up with a Turing machine representing the 
meaningful regular information in the data together with a 
program for that Turing machine representing the remain- 
ing meaningless randomness of the data. However, in gen- 
eral there are many ways to make the division into mean- 
ingful information and remaining random information. In 
a painting the represented image, the brush strokes, or even 
finer detail can be the relevant information, depending on 
what we are interested in. What we require is a rigorous 
mathematical condition to force a sensible division of the 
information at hand into a meaningful part and a mean- 
ingless part. 

Algorithmic Statistics: The two-part code approach 
leads to a more general algorithmic approach to statistics. 
The algorithmic statistician's task is to select a model (de- 
scribed possibly by a probability distribution) for which the 
data is typical. In a two-part description, we describe such 
a model and then identify the data within the set of the 
typical outcomes. The best models make the two-part de- 
scription as concise as the best one-part description of the 
data. A description of such a model is an algorithmic suffi- 
cient statistic since it summarizes all relevant properties of 
the data. Among the algorithmic sufficient statistics, the 
simplest one (an algorithmic minimal sufficient statistic) 
is best in accordance with Ockham's Razor since it sum- 
marizes the relevant properties of the data as concisely as 
possible. In probabilistic data or data subject to noise this 
involves separating regularity (structure) in the data from 
random effects. 

In a restricted setting where the models are finite sets a 
way to proceed was suggested by Kolmogorov, attribution 
in ]l7| ], p|, ||. Given data d, the goal is to identify the 
"most likely" finite set S of which d is a "typical" element. 
Finding a set of which the data is typical is reminiscent 
of selecting the appropriate magnification of a microscope 
to bring the studied specimen optimally in focus. For this 
purpose we consider sets S such that d € S and we rep- 
resent S by the shortest program S* that computes the 
characteristic function of S. The shortest program S* that 
computes a finite set S containing d, such that the two-part 
description consisting of S* and log \S\ is as as short as the 
shortest single program that computes d without input, is 
called an algorithmic sufficient statisti^ This definition is 
non- vacuous since there does exist a two-part code (based 
on the model Sd = {d}) that is as concise as the shortest 
single code. The description of d given S* cannot be sig- 
nificantly shorter than log By the theory of Martin-L6f 
randomness jl6| this means that d is a "typical" element 

It is also called the Kolmogorov sufficient statistic. 



of S. In general there can be many algorithmic sufficient 
statistics for data d; a shortest among them is called an al- 
gorithmic minimal sufficient statistic. Note that there can 
be possibly more than one algorithmic minimal sufficient 
statistic; they are defined by, but not generally computable 
from, the data. 

In probabilistic statistics the notion of sufficient statistic 
( [T^ ) is an average notion invariant under all probability 
distributions over the family of indexed models. If a statis- 
tic is not thus invariant, it is not sufficient. In contrast, in 
the algorithmic case we investigate the relation between the 
data and an individual model and therefore a probability 
distribution over the models is irrelevant. It is technically 
convenient to initially consider the simple model class of 
finite sets to obtain our results. It then turns out that it is 
relatively easy to generalize everything to the model class 
of computable probability distributions. That class is very 
large indeed: perhaps it contains every distribution that 
has ever been considered in statistics and probability the- 
ory, as long as the parameters are computable numbers — 
for example rational numbers. Thus the results are of great 
generality; indeed, they are so general that further devel- 
opment of the theory must be aimed at restrictions on this 
mod el cl ass, see the discussion about applicability in Sec- 
tion VII . The theory concerning the statistics of individual 



data samples and models one may call algorithmic statis- 
tics. 

Background and Related Work: At a Tallinn confer- 
ence in 1973, A.N. Kolmogorov formulated the approach to 
an individual data to model relation, based on a two-part 
code separating the structure of a string from meaningless 
random features, rigorously in terms of Kolmogorov com- 
plexity (attribution by 0, Cover [Q, ||] interpreted 
this approach as a (sufficient) statistic. The "statistic" of 
the data is expressed as a finite set of which the data is 
a "typical" member. Following Shen jf?]] (see also pi]] , 
Jill , p0[|), this can be generalized to computable probabil- 
ity mass functions for which the data is "typical." Related 
asp ects of "randomness deficiency" (formally defined later 
in ( IV.l )) were formulated in 12 1, and studied in G^j, 
PjJ . Algorithmic mutual information, and the associated 
non-increase law, were studied in fll4| , |l5f| . Despite its 
evident epistemological prominence in the theory of hy- 
pothesis selection and prediction, only selected aspects of 
the algorithmic sufficient statistic have been studied before, 
for example as related to the "Kolmogorov structure func- 
tion" (17) , EL and "absolutely non-stochastic objects" JTtJ , 
plfl , p8[ , 22 1, notions also defined or suggested by Kol- 
mogorov at the mentioned meeting. This work primarily 
studies quantification of the "non-sufficiency" of an algo- 
rithmic statistic, when the latter is restricted in complexity, 
rather than necessary and sufficient conditions for the ex- 
istence of an algorithmic sufficient statistic itself. These 
references obtain results for plain Kolmogorov complex- 
ity (sometimes length-conditional) up to a logarithmic er- 
ror term. Especially for regular data that have low Kol- 
mogorov complexity with respect to their length, this loga- 
rithmic error term may dominate the remaining terms and 
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eliminate all significance. Since it is precisely the regular 
data that one wants to assess the meaning of, a more pre- 
cise analysis as we provide is required. Here we use prefix 
complexity to unravel the nature of a sufficient statistic. 
The excellent papers of Shen ]l7| , ]T^ | contain the major 
previous results related to this work (although is in- 
dependent). While previous work and the present paper 
consider an algorithmic statistic that is either a finite set 
or a computable probability mass function, the most gen- 
eral algorithmic statistic is a recursive function. In Q| the 
present work is generalized accordingly, see the summary 



in Section VII 



For the relation with inductive reasoning according to 
minimum description length principle see |2C| ]. The en- 
tire approach is based on Kolmogorov complexity (also 
known as algorithmic information theory). Historically, 
the idea of assigning to each object a probability consist- 
ing of the summed negative exponentials of the lengths of 
all programs computing the object, was first proposed by 
Solomonoff ]h|. Then, the shorter programs contribute 
more probability than the longer ones. His aim, ultimately 
successful in terms of theory (see [ [tol ) and as inspiration 
for developing applied versions 0, was to develop a gen- 
eral prediction method. Kolmogorov introduced the 
complexity proper. The prefix- version of Kolmogorov com- 
plexity used in this paper was introduced in |L4| and also 
treated later in ||. For a textbook on Kolmogorov com- 
plexity, its mathematical theory, and its application to in- 
duction, see fnj. We give a definition (attributed to Kol- 
mogorov) and results from 17 that are useful later: 

Definition 1.1: Let a and (3 be natural numbers. A finite 
binary string x is called (a, fa) -stochastic if there exists a 
finite set S C {0, 1}* such that 



xeS, K{S) < a, K(x) > log \S\ - fa 



(1.4) 



where |5| denotes the cardinality of S, and K(-) the (prefix- 
) Kolmogorov complexity. As usual, "log" denotes the bi- 
nary logarithm. 

The first inequality with small a means that S is "sim- 
ple"; the second inequality with (3 is small means that x 
is "in general position" in S. Indeed, if x had any spe- 
cial property p that was shared by only a small subset Q 
of S, then this property could be used to single out and 
enumerate those elements and subsequently indicate x by 
its index in the enumeration. Altogether, this would show 
K{x) < K(p) + log|Q|, which, for simple p and small Q 
would be much lower than log | 1 - A similar notion for 
computable probability distributions is as follows: Let a 
and (3 be natural numbers. A finite binary string x is 
called (a, [3)-quasistochastic if there exists a computable 
probability distribution P such that 

P(x)>0, K{P)<a, K{x) > -logP(x) -fa (1.5) 



Proposition 1.2: There exist constants c and C, such 
that for every natural number n and every finite binary 
string x of length n: 



(a) if x is (a, /3)-stochastic, then x is (a + c, fa)- 
quasistochastic; and 

(b) if x is (a, /3)-quasistochastic and the length of x is 
less than n, then x is (a + c log n, (3 + C)-stochastic. 

Proposition 1.3: (a) There exists a constant C such that, 
for every natural number n and every a and (3 with a > 
log n + C and a + (3 > n + 4 log n + C, all strings of length 
less than n are (a, /3)-stochastic. 

(b) There exists a constant C such that, for every natural 
number n and every a and (3 with 2a + [3 < n — 6 log n — C, 
there exist strings x of length less than n that are not 
(a, /3)-stochastic. 

Note that if we take a = f3 then, for some boundary in be- 



tween j^n 



and in, the last non-(a, /3)-stochastic elements 



disappear if the complexity constraints are sufficiently re- 
laxed by having a, (3 exceed this boundary. 

Outline of this Work: First, we obtain a new Kol- 
mogorov complexity "triangle" inequality that is useful in 
the later parts of the paper. We define algorithmic mu- 
tual information between two individual objects (in con- 
trast to the probabilistic notion of mutual information that 
deals with random variables). We show that for every 
computable distribution associated with the random vari- 
ables, the expectation of the algorithmic mutual informa- 
tion equals the probabilistic mutual information up to an 
additive constant that depends on the complexity of the 
distribution. It is known that in the probabilistic setting 
the mutual information (an average notion) cannot be in- 
creased by algorithmic processing. We give a new proof 
that this also holds in the individual setting. 

We define notions of "typicality" and "optimality" of 
sets in relation to the given data x. Denote the shortest 
program for a finite set S by S* (if there is more than one 
shortest program S* is the first one in the standard effective 
enumeration). "Typicality" is a reciprocal relation: A set 
S is "typical" with respect to a; if x is an element of S 
that is "typical" in the sense of having small randomness 
deficiency Sg(x) — log|5| — K(x\S*) (see definition ( 1V.1 ) 
and discussion). That is, x has about maximal Kolmogorov 
complexity in the set, because it can always be identified 
by its position in an enumeration of S in log IS"! bits. Every 
description of a "typical" set for the data is an algorithmic 
statistic. 

A set S is "optimal" if the best two-part description con- 
sisting of a description of S and a straightforward descrip- 
tion of x as an element of S by an index of size log IS*! is 
as concise as the shortest one-part description of x. This 
implies that optimal sets are typical sets. Descriptions of 
such optimal sets are algorithmic sufficient statistics, and a 
shortest description among them is an algorithmic minimal 
sufficient statistic. The mode of description plays a major 
role in this. We distinguish between "explicit" descriptions 
and "implicit" descriptions — that are introduced in this pa- 
per as a proper restriction on the recursive enumeration 
based description mode. We establish range constraints of 
cardinality and complexity imposed by implicit (and hence 
explicit) descriptions for typical and optimal sets, and ex- 
hibit a concrete algorithmic minimal sufficient statistic for 
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implicit description mode. It turns out that only the com- 
plexity of the data sample x is relevant for this implicit 
algorithmic minimal sufficient statistic. Subsequently we 
exhibit explicit algorithmic sufficient statistics, and an ex- 
plicit minimal algorithmic (near-)sufhcient statistic. For 
explicit descriptions it turns out that certain other aspects 
of x (its enumeration rank) apart from its complexity are 
a major determinant for the cardinality and complexity of 
that statistic. It is convenient at this point to introduce 
some notation: 

Notation 1. 4: From now on, we will denote by < an in- 
equality to within an additive constant, and by = the sit- 
uation when both < and > hold. We will also use < to 
denote an inequality to within an multiplicative constant 
factor, and = to denote the situation when both < and > 
hold. 

Let us contrast our approach with the one in |Q. The 
comparable case there, by (|L4|), is that xis(a, /3)-stochastic 
with f3 = and a minimal. Then, K(x) > log \S\ for a 
set S of Kolmogorov complexity a. But, if S is optimal 
for x, then, as we formally define it later ( frll.4 ), K{x) = 
K{S) + log|5|. That is @ holds with [3 = -K(S). In 
contrast, for [3 = we must have K(S) = for typicality. 
In short, optimality of S with repect to x corresponds to 
( |L4| ) by dropping the second item and replacing the third 
item by K{x) = log \S\ + K(S). "Minimality" of the algo- 
rithmic sufficient statistic S* (the shortest program for S) 
corresponds to choosing S with minimal K(S) in this equa- 
tion. This is equivalent to (1.4) with inequalities replaced 
by equalities and K(S) = a = —f3. 

We consider the functions related to (a, /3)-stochasticity, 
and improve Shen's result on maximally non-stochastic ob- 
jects. In particular, we show that for every n there are ob- 
jects x of length n with complexity K(x \ n) about n such 
that every explicit algorithmic sufficient statistic for x has 
complexity about n ({x} is such a statistic). This is the 
best possible. In Section we generalize the entire treat- 
ment to probability density distributions. In Section VI 
we connect the algorithmic and probabilistic approaches: 
While previous authors have used the name "Kolmogorov 
sufficient statistic" because the model appears to summa- 
rize the relevant information in the data in analogy of what 
the classic sufficient statistic does in a probabilistic sense, 
a formal justification has been lacking. We give the for- 
mal relation between the algorithmic approach to sufficient 
statistic and the probabilistic approach: A function is a 
probabilistic sufficient statistic iff it is with high probabil- 
ity an algorithmic ^-sufficient statistic, where an algorith- 
mic sufficient statistic is 6-sufficient if it satisfies also the 
sufficiency criterion conditionalized on 9. 

II. Kolmogorov Complexity 

We give some definitions to establish notation. For in- 
troduction, details, and proofs, see |fjo| . We write string 
to mean a finite binary string. Other finite objects can be 
encoded into strings in natural ways. The set of strings is 
denoted by {0, 1}*. The length of a string x is denoted by 



l(x), distinguishing it from the cardinality \S\ of a finite set 

s. 

Let x,y, z G M , where J\f denotes the natural numbers. 
Identify M and {0, 1}* according to the correspondence 

(0, e), (1,0), (2, 1), (3, 00), (4, 01), .... 

Here e denotes the empty word " with no letters. The length 
l(x) of x is the number of bits in the binary string x. For 
example, Z(010) = 3 and /(e) = 0. 

The emphasis is on binary sequences only for conve- 
nience; observations in any alphabet can be so encoded 
in a way that is 'theory neutral'. 

A binary string a; is a proper prefix of a binary string y 
if we can write y = xz for z ^ e. A set {x, y, . . . } C {0, 1}* 
is prefix-free if for any pair of distinct elements in the set 
neither is a proper prefix of the other. A prefix-free set is 
also called a prefix code. Each binary string x = X1X2 ■ ■ ■ x n 
has a special type of prefix code, called a self- delimiting 
code, 

x = I"0a;ia;2 . . . x n . 

This code is self-delimiting because wc can determine where 
the code word x ends by reading it from left to right without 
backing up. Using this code we define the standard self- 
delimiting code for x to be x' = l(x)x. It is easy to check 
that l(x) = 2n+l and l(x') = n + 21ogn + f . 

Let (•, •} be a standard one-one mapping from J\f x J\f to 
A/", for technical reasons chosen such that l((x,y)) = l(y) + 
l(x)+2l(l(x))+l, for example (x, y) = x'y = l l{ - l ^Ql{x)xy. 
This can be iterated to ((•,•},•}. 

The prefix Kolmogorov complexity, or algorithmic en- 
tropy, K{x) of a string x is the length of a shortest bi- 
nary program to compute a; on a universal computer (such 
as a universal Turing machine). For technical reasons we 
require that the universal machine has the property that 
no halting program is a proper prefix of another halting 
program. Intuitively, K(x) represents the minimal amount 
of information required to generate x by any effective pro- 
cess. We denote the shortest program for 1 by a;*; then 
K(x) — l(x*). (Actually, x* is the first shortest program for 
x in an appropriate standard enumeration of all programs 
for x such as the halting order.) The conditional Kol- 
mogorov complexity K(x \ y) of x relative to y is defined 
similarly as the length of a shortest program to compute x 
if y is furnished as an auxiliary input to the computation. 
We often use K(x | y*), or, equivalently, K(x \ y,K{y)) 
(trivially y* contains the same information as the y, K(y)). 
Note that "y" in the conditional is just the information 
about y and apart from this does not contain information 
about y* or K(y). For this work the difference is crucial, 
see the comment in Section |. 

A. Additivity of Complexity 

Recall that by definition K{x,y) = K((x,y)). Trivially, 
the symmetry property holds: K{x,y) = K(y,x). Later 
we will use many times the "Additivity of Complexity" 
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property 

K{x, y) = K{x) + K(y \ x*) = K{y) + K(x \ y*). (Ill) 

This result due to |jj can be found as Theorem 3.9.1 in [|l0| 
and has a difficult proof. It is perhaps instructive to point 
out that the version with just x and y in the conditionals 
doesn't hold with =, but holds up to additive logarithmic 
terms that cannot be eliminated. The conditional version 
needs to be treated carefully. It is 

K(x, y\z) = K{x | z) + K{y \ x, K(x \ z),z). (11.2) 

Note that a naive version 

K(x,y | z) = K(x | z)+K(y \ x*,z) 

is incorrect: taking z = x, y — K(x), the left-hand side 
equals K(x* \ x), and the right-hand side equals K(x \ x) + 
K{K(x) x*,x)=0. First, we derive a (to our knowledge) 
new "directed triangle inequality" that is needed later. 
Theorem II. 1: For all x, y, z, 

K(x | y*) < K(x, z\y*)< K(z | y*) + K(x \ z*). 



Proof: Using (11.1), an evident inequality introducing 
an auxiliary object z, and twice ( II. 1) again: 



K(x, z | y*) = K (x, y, z) - K(y) 



< K{z) + K(x | z*) + K(y | z*) - K(y) 

±K(y,z)-K(y)+K(x | z*) 

+ 



K(x 



■K{z\y* 



This theorem has bizarre consequences. These conse- 
quences are not simple unexpected artifacts of our defini- 
tions, but, to the contrary, they show the power and the 
genuine contribution to our understanding represented by 
the deep and important mathematical relation (11.1). 

Denote k = K(y) and substitute k — z and K(k) = x to 
find the following counterintuitive corollary: To determine 
the complexity of the complexity of an object y it suffices 
to give both y and the complexity of y. This is counterintu- 
itive since in general we cannot compute the complexity of 
an object from the object itself; if we could this would also 
solve the so-called "halting problem" , . This noncom- 
putability can be quantified in terms of K(K(y) | y) which 
can rise to almost K(K(y)) for some y — see the related dis- 
cussion on notation for conditional complexity in Section |[ 
But in the seemingly similar, but subtly different, setting 
below it is possible. 

Corollary II. 2: As above, let k denote K(y). Then, 

K(K(k) \y,k) = K(K(k) | y*) < K(K(k) \ k*) + K(k 
y, k) = 0. We can iterate this idea. For example, the 
next step is that given y and K(y) we can determine 
K(K(K(y))) in 0(1) bits, that is, K(K(K(k))) \ y,k) = 0. 

A direct construction works according to the following 
idea (where we ignore some important details): From k* 



one can compute {k, K(k)) since k* is by definition the 
shortest program for k and also by definition l(k*) — K(k). 
Conversely, from k,K(k) one can compute k*: by run- 
ning of all programs of length at most K(k) in dove- 
tailed fashion until the first programme of length K(k) 
halts with output k; this is k* . The shortest program 
that computes the pair (y, k) has length = k: We have 
K(y 7 k) = k (since the shortest program y* for y car- 
ries both the information about y and about k — l(y*)). 



By (II.l) therefore K(k) + K(y | k,K(k)) = k. In view 



of the information equivalence of (k,K(k)) and k*, there- 
fore K(k) + K(y | k*) = k. Let r be a program of length 
l(r) — K(y | k*) that computes y from k* . Then, since 
l(k*) = K(k), there is a shortest program y* — qk*r for 
y where q is a fixed 0(1) bit self-delimiting program that 
unpacks and uses k* and r to compute y. We are now in 
the position to show K(K(k) \ y,k) = 0. There is a fixed 
0(l)-bit program, that includes knowledge of q, and that 
enumerates two lists in parallel, each in dovetailed fashion: 
Using k it enumerates a list of all programs that compute 
k, including k* . Given y and k it enumerates another list 
of all programs of length k = l(y*) that compute y. One 
of these programs is y* = qk*r that starts with qk*. Since 
q is known, this self-delimiting program k* , and hence its 
length K(k), can be found by matching every element in 
the fc-list with the prefixes of every element in the y list in 
enumeration order. 



B. Information Non-Increase 

If we want to find an appropriate model fitting the data, 
then we are concerned with the information in the data 
about such models. Intuitively one feels that the infor- 
mation in the data about the appropriate model cannot be 
increased by any algorithmic or probabilistic process. Here, 
we rigorously show that this is the case in the algorithmic 
statistics setting: the information in one object about an- 
other cannot be increased by any deterministic algorithmic 
method by more than a constant. With added randomiza- 
tion this holds with overwhelming probability. We use the 
triangle inequality of Theorem II.l to recall, and to give 
possibly new proofs, of this information non-increase; for 
more elaborate but hard-to- follow versions see |t4| , Jll| . 

We need the following technical concepts. Let us call 
a nonnegative real function f(x) defined on strings a 
semimeasure if ^2 x f(x) < 1, and a measure (a proba- 
bility distribution) if the sum is 1. A function f(x) is 
called lower semicomputable if there is a rational valued 
computable function g{n, x) such that g(n + l,x) > g(n, x) 
and linin^oo g(n, x) = f(x). For an upper semicomputable 
function / we require that — / is lower semicomputable. It 
is computable when it is both lower and upper semicom- 
putable. (A lower semicomputable measure is also com- 
putable.) 

To define the algorithmic mutual information between 
two individual objects x and y with no probabilities in- 
volved, it is instructive to first recall the probabilistic no- 



GACS, TROMP, AND VITANYI: ALGORITHMIC STATISTICS 



7 



tion ([Til) Rewriting (LI) as 

x y 

and noting that — logp(s) is very close to the length of 
the prefix- free Shannon- Fano code for s, we are led to the 
following definition. Q The information in y about x is 
defined as 

I(y : x) = K(x) - K(x \ y*) ± K(x) + K{y) - K(x, y), 

(II.3) 

where the second equality is a consequence of ( |lLl| ) and 
states that this information is symmetrical, I(x : y) = I(y : 
x), and therefore we can talk about mutual information!^ 
Remark II. 3: The conditional mutual information is 

I{x:y\z)= K(x | z) - K(x | y, K(y \ z), z) 

±=K(x | z)+K(y | z)-K(x,y\ z). 



It is important that the expectation of the algorithmic mu- 
tual information I(x : y) is close to the probabilistic mu- 
tual information I(X; Y) — if this were not the case then 
the algorithmic notion would not be a sharpening of the 
probabilistic notion to individual objects, but something 
else. 

Lemma II. 4: Given a computable joint probability mass 
distribution p(x, y) over (x,y) we have 



<I(X;Y)+2K(p), 



(II.4) 



where K (j>) is the length of the shortest prefix- free program 
that computes p(x,y) from input (x,y). 

Remark II. 5: Above we requiredp(-, •) to be computable. 
Actually, we only require that p be a lower semicomputable 
function, which is a weaker requirement than recursivity. 
However, together with the condition that p(-, •) is a proba- 
bility distribution, y P( x i v) = 1j this means that p(-, •) 
is computable, 0, Section 8.1. () 
Proof: Rewrite the expectation 

^^pfayVfa : y) - ^2^y( x ,y)[ K ( x ) 

x y x y 

+ K(y)-K(x,y)}. 
Define Y^ y P( x >v) = Pi( x ) and J2 X P( X > v) =Pz{v) to obtain 

x y x y 

- ^2p(x,y)K(x,y). 

2 The Shannon-Fano code has nearly optimal expected code length 
ajual to the entropy with respect to the distribution of the source 
|a]. However, the prefix-free code with code word length K(s) has 
both about expected optima l co de word length and individual optimal 
effective code word length, [LCI. 

3 The notation of the algorithmic (individual) notion I(x : y) distin- 
guishes it frorn_Lhe probabilistic (average) notion I(X ; Y). We deviate 
slightly from jL0| where I(y : x) is defined as K(x) — K(x \ y). 



Given the program that computes p, we can approximate 
Pi( x ) by a qi(x,y Q ) = J2 v < Vo P( x ^y)^ and similarly for p 2 . 
That is, the distributions p j (i = 1,2) are lower semi- 



computable, and by Remark II. 5, therefore, they are com- 
putable. It is known that for every computable proba- 
bility mass function q we have H(q) < Yl x q{ x )K(x) < 
H(q) + K{q), JTo|, Section 8.1. 

Hence, H( Pl ) < J2 X P^( X ) K ( X ) < Hfa) + k (p*) (* = 
1,2), and H(p) < E x , y P(x,y)K(x,y) < H(p) + K(p). 
On the other hand, the probabilistic mutual information 
( |t"l| ) is expressed in the entropies by I(X;Y) = H(pi) + 
H(p2) — H{p). By construction of the g^'s above, we have 

K(pi), K(p2) < K{p). Since the complexities are positive, 
substitution establishes the lemma. ■ 

Can we get rid of the K(jp) error term? The answer is 
affirmative; by putting p(-) in the conditional we even get 
rid of the computability requirement. 

Lemma II. 6: Given a joint probability mass distribution 
p(x, y) over (x, y) (not necessarily computable) we have 

i(x-,Y)^j2J2p^ I ( x: y\p^ 

x y 

where the auxiliary p means that we can directly access 
the values p(x, y) on the auxiliary conditional information 
tape of the reference universal prefix machine. 

Proof: The lemma follows from the definition of con- 
ditional algorithic mutual information, Remark [II. 3| , if we 
show that ^2 x p(x)K(x \ p) = H(p), where the O(l) term 
implicit in the = sign is independent of p. 

Equip the reference universal prefix machine, with an 
O(l) length program to compute a Shannon-Fano code 
from the auxiliary table of probabilities. Then, given an in- 
put r, it can determine whether r is the Shannon-Fano code 
word for some x. Such a code word has length = — \ogp(x). 
If this is the case, then the machine outputs x, otherwise 

it halts without output. Therefore, K(x \ p) < — \ogp{x). 
This shows the upper bound on the expected prefix com- 
plexity. The lower bound follows as usual from the Noise- 
less Coding Theorem. ■ 

We prove a strong version of the information non- 
increase law under deterministic processing (later we need 
the attached corollary): 

Theorem II. 7: Given x and z, let q be a program com- 
puting z from x* . Then 



I(z : y) < I{x : y)+K(q). 
Proof: By the triangle inequality, 

K(y | x*) < K{y \ z*) + K{z \ x*) 
^K(y | z*)+K{q). 



(H.5) 



Thus, 



I(x : y) = K(y) - K(y | x*) 

>K(y)-K(y | z*) - K(q) 
= I(z:y)-K(q). 



cS 
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This also implies the slightly weaker but intuitively more 
appealing statement that the mutual information between 
strings x and y cannot be increased by processing x and y 
separately by deterministic computations. 

Corollary II. 8: Let /, g be recursive functions. Then 



/(/Or) : g{y)) < I(x : y) + K(f) + K(g). (11.6) 
Proof: It suffices to prove the case g(y) = y and 
apply it twice. The proof is by replacing the program q 
t hat computes a particular string z from a particular x* in 
( [1.5 ). There, q possibly depends on x* and z. Replace it 
by a program qf that first computes x from x* , followed by 
computing a recursive function /, that is, qf is independent 
of x. Since we only require an 0(l)-length program to 

compute x from x* we can choose /(<?/) = K(f)- 
By the triangle inequality, 



K(y\x*)<K(y\f(x)*)+K(f(x) 
±K(y\f(x)*)+K(f). 



x ) 



Thus, 



I(x : y) = K(y) - K(y \ x*) 

>K(y)-K(y\f(x)*)-K(f) 
= I(f(x):y)-K(f). 

m 

It turns out that furthermore, randomized computation 
can increase information only with negligible probability. 
Let us define the universal probability m(x) — 2~ K ( X \ This 
function is known to be maximal within a multiplicative 
constant among lower semicomputable semimeasures. So, 
in particular, for each computable measure v[x) we have 
v(x) < m(:r), where the constant factor in < depends on v. 
This property also holds when we have an extra parameter, 
like y*, in the condition. 

Suppose that z is obtained from x by some randomized 
computation. The probability p(z \ x) of obtaining z from 
a; is a semicomputable distribution over the z's. Therefore 
it is upperbounded by m(z | x) < m(z | x*) — 2~ K ( Z \ X 
The information increase I(z : y) — I(x : y) satisfies the 
theorem below. 

Theorem II. 9: For all x, y, z we have 

m(z | X *) 2 '(^W(^) < m ( z | x*,y,K(y \ x*)). 
Remark 11.10: For example, the probability of an in- 
crease of mutual information by the amount d is < 2~ d . 
The theorem implies £ 2 m ( z I x*)2^ z ^- I( - x ^ < 1, the 
m(- | x*)-expectation of the exponential of the increase is 
bounded by a constant. 
Proof: We have 

I(z : y) - I(x : y) = K(y) - K(y | z*) - (K(y) - K(y | x*)) 
= K(y\x*)-K(y\z*). 

The negative logarithm of the left-hand side in the theorem 
is therefore 

K(z | x*) +K(y | z*)-K{y \ x*). 



Using Theorem [LI, and the conditional additivity (11.2), 
this is 

> K(y, z\x*)- K(y | x*) ± K(z | x*,y, K(y | x*)). 



III. Finite Set Models 

For convenience, we initially consider the model class 
consisting of the family of finite sets of finite binary strings, 
that is, the set of subsets of {0, 1}*. 

A. Finite Set Representations 

Although all finite sets are recursive there are different 
ways to represent or specify the set. We only consider ways 
that have in common a method of recursively enumerat- 
ing the elements of the finite set one by one, and differ in 
knowledge of its size. For example, we can specify a set of 
natural numbers by giving an explicit table or a decision 
procedure for membership and a bound on the largest ele- 
ment, or by giving a recursive enumeration of the elements 
together with the number of elements, or by giving a recur- 
sive enumeration of the elements together with a bound on 
the running time. We call a representation of a finite set S 
explicit if the size IS] of the finite set can be computed from 
it. A representation of S is implicit if the logsize [log I^IJ 
can be computed from it. 

Example III.l: In Section [II-D, we will introduce the 
set S k of strings whose elements have complexity < k. It 



will be shown that this set can be represented implicitly by 
a program of size K(k), but can be represented explicitly 
only by a program of size k. 

Such representations are useful in two-stage encodings 
where one stage of the code consists of an index in S of 
length = log \ S\. In the implicit case we know, within an 
additive constant, how long an index of an element in the 
set is. 

We can extend the notion of Kolmogorov complexity 
from finite binary strings to finite sets: The (prefix-) com- 
plexity Kx (S) of a finite set S is defined by 

Kx(S) = miii{K(i) :Turing machine Ti computes S 

i 

in representation format X}, 

where X is for example "implicit" or "explicit" . In general 
S* denotes the first shortest self-delimiting binary program 
(1{S*) = K(S)) in enumeration order from which S can be 
computed. These definitions depend, as explained above, 
crucial on the representation format X: the way S is sup- 
posed to be represented as output of the computation can 
make a world of difference for S* and K(S). Since the rep- 
resentation format will be clear from the context, and to 
simplify notation, we drop the subscript X. To complete 
our discussion: the worst case of representation format X, 
a recursively enumerable representation where nothing is 
known about the size of the finite set, would lead to in- 
dices of unknown length. We do not consider this case. 
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We may use the notation 

*5*impl j <Sexpl 

for some implicit and some explicit representation of S. 
When a result applies to both implicit and explicit rep- 
resentations, or when it is clear from the context which 
representation is meant, we will omit the subscript. 

B. Optimal Model and Sufficient Statistic 

In the following we will distinguish between "models" 
that are finite sets, and the "shortest programs" to com- 
pute those models that are finite strings. Such a shortest 
program is in the proper sense a statistic of the data sam- 
ple as defined before. In a way this distinction between 
"model" and "statistic" is artificial, but for now we prefer 
clarity and unambiguousness in the discussion. 

Consider a string x of length n and prefix complexity 
K(x) = k. We identify the structure or regularity in x that 
are to be summarized with a set S of which x is a random 
or typical member: given S (or rather, an (implicit or ex- 
plicit) shortest program S* for S), x cannot be described 
significantly shorter than by its maximal length index in 

S, that is, K(x | S*) > \og\S\. Formally, 

Definition III. 2: Let [3 > be an agreed upon, fixed, 
constant. A finite binary string x is a typical or random 
element of a set S of finite binary strings if x € S and 



K(x | S*) > \og\S\ -p, 



(III.l) 



where S* is an implicit or explicit shortest program for S. 
We will not indicate the dependence on /3 explicitly, but 

the constants in all our inequalities (<) will be allowed to 
be functions of this (3. 

This definition requires a finite S. In fact, since K(x 

S*) < K(x), it limits the size of S to 0{2 k ) and the shortest 
program S* from which S can be computed) is an algorith- 
mic statistic for x iff 



K(x | S*) ± log|5|. 



(III.2) 



Note that the notions of optimality and typicality are not 
absolute but depend on fixing the constant implicit in the 
=. Depending on whether S* is an implicit or explicit pro- 
gram, our definition splits into implicit and explicit typi- 
cality. 

Example III. 3: Consider the set S of binary strings of 
length n whose every odd position is 0. Let x be an element 
of this set in which the subsequence of bits in even positions 
is an incompressible string. Then S is explicitly as well as 
implicitly typical for x. The set {x} also has both these 
properties. <) 

Remark III. J^: It is not clear whether explicit typicality 
implies implicit typicality. Section IV will show some ex- 



amples which are implicitly very non-typical but explicitly 
at least nearly typical. <0 
There are two natural measures of suitability of such a 
statistic. We might prefer either the simplest set, or the 
largest set, as corresponding to the most likely structure 



'explaining' x. The singleton set {x}, while certainly a 
statistic for x, would indeed be considered a poor explana- 
tion. Both measures relate to the optimality of a two-stage 
description of x using S: 



K(x) < K(x, S) = K(S) + K(x | S*) 
<K(S) + \og\S\, 



(III.3) 



where we rewrite K(x, S) by ( II. 1 ). Here, S can be under- 
stood as either S lmp i or S C x P i- Call a set S (containing x) 
for which 



K{x)±K{S)+\og\S\, 



(IIL4) 



optimal. Depending on whether K{S) is understood as 
-^(■S'impi) or K(S cxp i), our definition splits into implicit and 
explicit optimality. Mindful of our distinction between a 
finite set S and a program that describes S in a required 
representation format, we call a shortest program for an op- 
timal set with respect to x an algorithmic sufficient statistic 
for x. Furthermore, among optimal sets, there is a direct 
trade-off between com plexit y and logsize, which together 



sum to = k. Equality ( III.4 ) is the algorithmic equivalent 



dealing with the relation between the individual sufficient 
statistic and the individual data sample, in contrast to the 
probabilistic notion (|^). 

Example III. 5: The following restricted model family il- 
lustrates the difference between the algorithmic individ- 
ual notion of sufficient statistic and the probabilistic av- 
eraging one. Foreshadowing the discussion in section VII 



this example also illustrates the idea that the semantics 
of the model class should be obtained by a restriction on 
the family of allowable models, after which the (minimal) 
sufficient statistic identifies the most appropriate model 
in the allowable family and thus optimizes the parame- 
ters in the selected model class. In the algorithmic set- 
ting we use all subsets of {0, 1}™ as models and the short- 
est programs computing them from a given data sample 
as the statistic. Suppose we have background informa- 
tion constraining the family of models to the n + 1 finite 
sets S s — {x 6 {0,1}" : x = x\ . . . x n &c X)"=i Xi = s ) 
(0 < s < n). Assume that our model family is the family 
of Bernoulli distributions. Then, in the probabilistic sense 
for every data sample x = x\ . . . x n there is only one natu- 
ral sufficient statistic: for Xi = s this is T{x) — s with 
the corresponding model S s . In the algorithmic setting the 
situation is more subtle. (In the following example we use 
the complexities conditional on n.) For with 
Y2i x i = f taking Sa as model yields \Sr\ = (2), and 

therefore log \Sq | = n — ^ logn. The sum of K(S^ \n) = 

and the logarithmic term gives = n — ^ log n for the right- 
hand side of fllll.4| ). But taking x = 1010... 10 yields 
K(x \ n) — for the left-hand side. Thus, there is no 
algorithmic sufficient statistic for the latter x in this model 
class, while every x of length n has a probabilistic sufficient 
statistic in the model class. In fact, the restricted model 
class has algorithmic sufficient statistic for data samples 
x of length n that have maximal complexity with respect 
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to the frequency of "l"s, the other data samples have no 
algorithmic sufficient statistic in this model class. (} 
Examp le III. 6: It can be shown that the set S of Exam- 
ple [II.3 is also optimal, and so is {x}. Typical sets form a 
much wider class than optimal ones: {x, y} is still typical 
for x but with most y, it will be too complex to be optimal 
for x. 

For a perhaps less artificial example, consider complex- 
ities conditional on the length n of strings. Let y be a 
random string of length n, let S y be the set of strings of 
length n which have O's exactly where y has, and let x be 
a random element of S y . Then a; is a string random with 
respect to the distribution in which l's are chosen inde- 
pendently with probability 0.25, so its complexity is much 
less than n. The set S y is typical with respect to x but is 
too complex to be optimal, since its (explicit or implicit) 
complexity conditional on n is n. <0 

It follows that (programs for) optimal sets are statis- 
tics. Equality ([11.4) expresses the conditions on the algo- 
rithmic individual relation between the data and the suf- 
ficient statistic. Later we demonstrate that this relation 
implies that the probabilistic optimality of mutual informa- 
tion ( [iTl| ) holds for the algorithmic version in the expected 
sense. 

An algorithmic sufficient statistic T(-) is a sharper indi- 
vidual notion than a probabilistic sufficient statistic. An 
optimal set S associated with x (the shortest program com- 
puting S is the corresponding sufficient statistic associated 
with x) is chosen such that x is maximally random with 
respect to it. That is, the information in x is divided in a 
relevant structure expressed by the set S, and the remain- 
ing randomness with respect to that structure, expressed 
by it's index in S of log|5| bits. The shortest program 
for S is itself alone an algorithmic definition of structure, 
without a probabilistic interpretation. 

One can also consider notions of near-typical and near- 



optimal that arise from replacing the [3 in (III. 1 ) by some 
slowly growing functions, such as 0(logl(x)) or 0(logk) as 

In|l7^^21|, a function of k and x is defined as the lack 
of typicality of x in sets of complexity at most k, and they 
then consider the minimum k for which this function be- 
comes = or very small. This is equivalent to our notion 
of a typical set. See the discussion of this function in Sec- 
~" In §, 1, 



tion 



[V 



only optimal sets are considered, and 
the one with the shortest program is identified as the al- 
gorithmic minimal sufficient statistic of x. Formally, this 
is the shor test program that computes a finite set S such 
that fllIL4| ) holds. 



C. Properties of Sufficient Statistic 

We start with a sequence of lemmas that will be used 
in the later theorems. Several of these lemmas have two 
versions: for implicit sets and for explicit sets. In these 
cases, S will denote Simpi or iSexpi respectively. 

Below it is shown that the mutual information between 
every typical set and the data is not much less than 
K(K(x)), the complexity of the complexity K(x) of the 



data x. For optimal sets it is at least that, and for algo- 
rithmic minimal statistic it is equal to that. The number 
of elements of a typical set is determined by the following: 

Lemma III. 7: Let k = K(x). If a set S is (implicitly or 
explicitly) typical for x then I(x : S) = k — log \S\. 

Proof: By definition I(x : S) = K{x) - K(x \ S*) 
and by typicality K(x \ S*) = log \S\. ■ 

Typicality, optimality, and minimal optimality succes- 
sively restrict the range of the cardinality (and complexity) 
of a corresponding model for a data x. The above lemma 
states that for (implicitly or explicitly) typical S the cardi- 
nality jS*! = Q(2 k ~ I< ^ x:S ' > ). The next lemma asserts that for 
implicitly typical S the value I{x : S) can fall below K(k) 
by no more than an additive logarithmic term. 

Lemma III. 8: Let k — K(x). If a set S is (implicitly or 

explicitly) typical for x then I(x : S) > K(k) — K(I(x : 

S)) and log|5| < k - K(k) + K(I(x : S)). (Here, S is 
understood as S- lmp i or S e xpi respectively.) 
Proof: Writing k = K(x), since 



k = K(k, x) = K{k) + K(x | k*) 



(III.5) 



by (pi]) , we have I(x : S) = K(x) - K(x | S*) = K(k) - 
[K(x | S*) — K{x | k*)]. Hence, it suffices to show K{x 

S*) - K(x | k*) < K(I(x : S)). Now, from an implicit 
description S* we can find the value = log l^l = k — I(x : 
S). To recover k we only require an extra K(I(x : S)) 

bits apart from S* . Therefore, K(k \ S*) < K(I(x : S)). 

This reduces what we have to show to K(x \ S*) < K(x 
k*) + K(k | S*) which is asserted by Theorem 

■ 

The term I(x : S) is at least K(k) — 2\ogK(k) where 
k = K{x). For x of length n with k > n and K{k) > 

l(k) > log 7i, this yields I(x : S) > logn — 2 log log n. 

If we further restrict typical sets to optimal sets then 
the possible number of elements in S is slightly restricted. 
First we show that implicit optimality of a set with respect 
to a data is equivalent to typicality with respect to the 
data combined with effective constructability (determina- 
tion) from the data. 

Lemma III. 9: A set S is (implicitly or explicitly) optimal 
for x iff it is typical and K(S \ x*) = 0. 

Proof: A set S is optimal iff ( |EII.3 ) holds with equal- 
ities. Rewriting K(x,S) = K[x) + K(S \ x*) the first in- 
equality becomes an equality iff K(S \ x*) = 0, and the sec- 
ond inequality becomes an equality iff K(x \ S*) = log \ S\ 
(that is, S is a typical set). ■ 

Lemma III. 10: Let k — K(x). If a set S is (implicitly 

or explicitly) optimal for x, then I(x : S) = K(S) > K(k) 
and log |5| < k-K(k). 

Proof: If S is optimal for x, then k — K{x) = K(S) + 
K(x | S*) = K(S) + \og\S\. From S* we can find both 
K(S) = l(S*) and = log | S | and hence k, that is, K(k) < 
K{S). We have I(x : S) = K(S) - K(S \ x*) ± K(S) 
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I(x:S) 




K(k) 



typical (initial constraint) 



k-K(k) k 



Fig. 1 



Range of statistic on the straight line 
I(x : S) ± K(a:)-log|S|. 



by (tl.l), Lemma [II.9, respectively. This proves the hrst 



property. Subst itution of I(x : S) > K(k) in the expression 
of Lemma [II. 7 proves the second property. ■ 



D. Implicit Minimal Sufficient Statistic 

A simplest implicitly optimal set (that is, of least com- 
plexity) is an implicit algorithmic minimal sufficient statis- 
tic. We demonstrate that S = {y : K(y) < k}, the set of 
all strings of complexity at most k, is such a set. First we 
establish the cardinality of S k : 

Lemma III 11: log = k - K{k). 

Proof: The lower bound is easiest. Denote by k* of 
length K{k) a shortest program for k. Every string s of 
length k — K(k) — c can be described in a self-delimiting 

manner by prefixing it with k*c* , hence K(s) < k — c + 
21ogc. For a large enough constant c, we have K(s) < k 
and hence there are fl(2 k ~ K ^) strings that are in S k . 
For the upper bound: by (HI.5), all x £ S k satisfy K(x | 



k*) < k — K(k), and there can only be 0{2 k K W) of them. 

■ 

From the definition of S k it follows that it is defined by k 
alone, and it is the same set that is optimal for all objects 
of the same complexity k. 

Theorem III. 12: The set S k is implicitly optimal for ev- 
ery x with K[x) — k. Also, we have K(S k ) = K(k). 

Proof: From k* we can compute both k and k — 
l(k*) = k — K(k) and recursively enumerate S k . Since also 



log \S k \ = k - K(k) (Lemma |IIL11[ ), the string k* plus 
a fixed program is an implicit description of S k so that 

K{k) > K{S k ). Hence, K{x) > K{S k ) + log \S k \ and since 
K(x) is the shortest description by definition equality (=) 
holds. That is, S k is optimal for x. By Lemma Ill.lCj 
K(S k ) > K(k) which together with the reverse inequality 
above yields K(S k ) = K(k) which shows the theorem. ■ 



Again using Lemma III. 1C shows that the optimal set 
S k has least complexity among all optimal sets for x, and 
therefore: 

Corollary III. 13: The set S k is an implicit algorithmic 
minimal sufficient statistic for every x with K(x) = k. 

All algorithmic minimal sufficient statistics S for x have 
K{S) = K(k), and therefore there are 0(2 K ^) of them. 
At least one such a statistic (S k ) is associated with every 
one of the 0(2 k ) strings x of complexity k. Thus, while 
the idea of the algorithmic minimal sufficient statistic is 
intuitively appealing, its unrestricted use doesn't seem to 
uncover most relevant aspects of reality. The only relevant 
structure in the data with respect to an algorithmic min- 
imal sufficient statistic is the Kolmogorov complexity. To 
give an example, an initial segment of 3.1415 ... of length n 
of complexity log n + 0(l) shares the same algorithmic suf- 
ficient statistic with many (most?) binary strings of length 
logn + 0(l). 

E. Explicit Minimal Sufficient Statistic 

Let us now consider representations of finite sets that are 
explicit in the sense that we can compute the cardinality 
of the set from the representation. 

E.l Explicit Minimal Sufficient Statistic: Particular Cases 

Example III. 14: The description program enumerates all 
the elements of the set and halts. Then a set like S k — {y : 
K(y) < k} has complexity = k |]l8| : Given the program we 
can find an element not in S k , which element by definition 
has complexity > k. Given S k we can find this element 
and hence S k has complexity > k. Let 



N 



\S k 



then by Lemma |III.ll| log N k ± k - K(k). We can list S k 
given k* and N k which shows K{S k ) < k. 
Example III. 15: One way of implementing explicit finite 
representations is to provide an explicit generation time for 
the enumeration process. If we can generate S k in time t 
recursively using k, then the previous argument shows that 
the complexity of every number t' > t satisfies K(t' ,k) > k 

so that K{t') > K(t' | k*) > k- K{k) by ( jjXj| ). This 
means that t is a huge time which as a function of k rises 
faster than every computable function. This argument also 
shows that explicit enumerative descriptions of sets S con- 
taining x by an enumerative process p plus a limit on the 
computation time t may take only l{p) + K(t) bits (with 
K{t) < \ogt + 2 log log t) but logt unfortunately becomes 
noncomputably largel <^> 
Example III. 16: Another way is to indicate the element 
of S k that requires the longest generation time as part of 
the dovetailing process, for example by its index i in the 

enumeration, i < 2 k - K< - k \ Then, K{i\k)<k- K(k). In 
fact, since a shortest program p for the ith element together 
with k allows us to generate S k explicitly, and abive we 
have seen that explicit description format yoelds K(S k ) = 
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k, we find we have K(p, k) > k and hence K(p) > k — K(k). 



In other cases the generation time is simply recursive in 
the input: S n — {y : l(y) < n} so that K(S n ) = K(n) < 
logn + 2 log log n. That is, this sufficient statistic for a 
random string x with K(x) = n + K(n) has complexity 
K (n) both for implicit descriptions and explicit descrip- 
tions: differences in complexity arise only for nonrandom 
strings (but not too nonrandom, for K(x) = these differ- 
ences vanish again). 

Lemma III. 17: S n is an example of a minimal sufficient 
statistic, both explicit and implicit, for all x with K(x) = 
n + K{n). 

Proof: The set S n is a sufficient statistic for x 
since K(x) = K(S n ) + log|5„|. It is minimal since by 

Lemma [II.10| we must have K(S) > K(K(x)) for implicit, 
and hence for explicit sufficient statistics. It is evident that 
S„ is explicit: |5 n | = 2". ■ 
It turns out that some strings cannot thus be explic- 
itly represented parsimonously with low-complexity mod- 
els (so that one necessarily has bad high complexity mod- 
els like S k above). For explicit representations, [|17| has 
demonstrated the existence of a class of strings called non- 
stochastic that don't have efficient two-part representations 
with K(x) = K(S) + \og\S\ (x S S) with K(S) signifi- 
cantly less than K(x). This result does not yet enable us 
to exhibit an explicit minimal sufficient statistic for such a 
string. But in Section [iy| we improve these results to the 
best possible, simultaneously establishing explicit minimal 
sufficient statistics for the subject ultimate non-stochastic 
strings: 

Lemma III. 18: For every length n, there exist strings x 
of length n with K(x \ n) = n for which {x} is an explicit 
minimal sufficient statistic. The proof is deferred to the 



end of Section IV 



E.2 Explicit Minimal Near-Sufficient Statistic: General 
Case 

Again, consider the special set S — {y : K(y) < k}. As 
we have seen earlier, S k itself cannot be explicitly optimal 
for x since K(S k ) = k and log N k = k—K(k), and therefore 
K(S k ) + \ogN k = 2k — K(k) which considerably exceeds 
k. However, it turns out that a closely related set (S^ 
below) is explicitly near-optimal. Let I k denote the index 
of y in the standard enumeration of S k , where all indexes 
are padded to the same length = k — K{k) with O's in front. 
For K(x) = k, let m x denote the longest joint prefix of I k 
and N k , and let 

I k = m x 0i x , N k = m x ln x . 

Lemma III. 19: For K(x) = k, the set 5*^ = {y e S k : 
m x a prefix of I k } satisfies 

log\Sij^k-K(k)-l(m x ), 

K{S k ) < K{k) + K(m x ) < K(k) + l{m x ) + K(l{m x )). 



Hence it is explicitly near-optimal for x (up to an addive 

K(l(m x )) < K(k) < logfe + 21oglogfc term). 

Proof: We can describe x by k*m*i x where m x 0i x 
is the index of x in the enumeration of S k . Moreover, 
k*m* explicitly describes the set S*j . Namely, using k 
we can recursively enumerate S k . At some point the first 
string z £ is enumerated (index /* = m^OO.-.O). 

By assumption I k = m x . . . and N k = m x l .... There- 
fore, in the enumeration of S k eventually string u with 
I k = m K 011...1 occurs which is the last string in the 
enumeration of S k lx . Thus, the size of is precisely 
2 i(JV fe )-i(«^) ; w here l{N k )~l{m x ) = l(n x ) = log |5^J, and 

is explicitly described by k*m*. Since l(k*m x 0i x ) = k 
and log |S^j | = k — K(k) — l{m x ) we have 

K{S k m J + log \S k J ^ K(k) + K{m x ) + k- K(k) - l{m x ) 
= k + K(m x ) - l(m x ) < k + K(l(m x )). 

This shows is explicitly near optimal for x (up to an 
additive logarithmic term). ■ 
Lemma III. 20: Every explicit optimal set S C S k con- 
taining x satisfies 

K{S) > K(k) + l(m x ) - K(l(m x )). 
Proof: If S C S k is explicitly optimal for x, then 
we can find k from S* (as in the proof of Lemma III. 10), 



and given k and S we find K{k) as in Theorem U.l 
Hence, given S* , we can enumerate S k and determine the 
maximal index I k of a y S S. Since also x E S 7 the 
numbers I k ,I k ,N k have a maximal common prefix m x . 
Write I k = m x 0i x with l(i x ) = k — K(k) — l(m x ) by 



Lemma [II. 1C . Given l{m x ) we can determine m x from I k 



Hence, from S,l(m x ), and i x we can reconstruct x. That 



is, K(S)+K(l(m x )) + l(I k ) 
lemma. 



l{m x ) > k, which yields the 



Lemmas III. 19, [11.20 demonstrate: 

Theorem III. 21: The set is an explicit algorithmic 
minimal near-sufficient statistic for x among subsets of S 
in the following sense: 

\K(Sij-K(k)-l(m x )\<K(l(m x )), 

log\S k \±k-K{k)-l(m x ). 



Hence K{S k m J + U>g|S*J 
K(l(m x )) < logfc + 21oglogfc. 



k ± K(l(m x )). Note, 



E.3 Almost Always "Sufficient" 

We have not completely succeeded in giving a concrete 
algorithmic explicit minimal sufficient statistic. However, 
we can show that is almost always minimal sufficient. 

The complexity and cardinality of S k n ^ depend on l(m x ) 
which will in turn depend on x. One extreme is l(m x ) = 
which happens for the majority of x's with K{x) = k — 
for example, the first 99.9% in the enumeration order. For 
those x's we can replace "near-sufficient" by "sufficient" 
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in Theorem [11.21. Can the other extreme be reached? 



This is the case when x is enumerated close to the end of 
the enumeration of S k . For example, this happens for the 
"non-stochastic" objects of which the existence was proven 
by Shen jlTj (see Section IV). For such objects, l(m x ) 
grows to = 



K(k) and the complexity of S„ 

+ 



rises to 



= k while log J drops to = 0. That is, the explicit 
algorithmic minimal sufficient statistic for x is essentially 
x itself. For those x's we can also rep lace 
with "sufficient" in Theorem III. 21 



near-sufficient" 
Generally: for the 
overwhelming majority of data x of complexity k the set 
is an explicit algorithmic minimal sufficient statistic 

among subsets of S k (since l(m x ) = 0). 

The following discussion will put what was said above 
into a more illuminating context. Let 

X(r) = {x: l(m x ) > r}. 

The set X(r) is infinite, but we can break it into slices and 
bound each slice separately. 
Lemma III. 22: 

\X(r)f](S k \S k - 1 )\ <2- r+1 |S fc |- 

Proof: 

For every x in the set defined by the left-hand side 
of the inequality, we have l{m x ) > r, and the length 
of continuation of m x to the total padded index of x is 



< [log | S k | 



r < log I S k 



1. Moreover, all these 



indices share the same first r bits. This proves the lemma. 



Theorem III. 23: 



2~ K(x) < 2- r+2 

xEX(r) 

Proof: Let us prove first 

J2^ k \S k \ < 2. 

fe>0 



(III. 



By the Kraft inequality, we have, with tk — \S k \ S k 1 |, 



fe>0 



since S k is in 1-1 correspondence with the prefix programs 
of length < k. Hence 



]T2- fe |S fc |=£2- fc 5>=]>>]r2- fe 

k>0 k>0 i=0 i>0 k=i 

= ^t t 2- 4+1 < 2. 

i>0 

For the statement of the lemma, we have 

2~ K ^ =^2- k \X{r)f\{S k \S k ~ 1 )\ 

xGX(r) k>0 

< 2~ r+1 Y 2~ k \S k \ < 2~ r+2 , 

k>0 



where in the last inequality we used ( |lll.6|) . ■ 

This theorem can be interpreted as follows, (we rely here 
on a discussion, unconnected with the present topic, about 
universal probability with L. A. Levin in 1973). The above 
theorem states YlxexM m ( x ) — 2~ r+2 . By the multiplica- 
tive dominating property of m(x) with respect to every 
lower semicomputable semimeasure, it follows that for ev- 
ery computable measure v, we have Ylx&X(r) v i x ) < 2~ r . 
Thus, the set of objects x for which l(m x ) is large has small 
probability with respect to every computable probability 
distribution. 

To shed light on the exceptional nature of strings x with 
large l(m x ) from yet another direction, let x be the infinite 
binary sequence, the halting sequence, which constitutes 
the characteristic function of the halting problem for our 
universal Turing machine: the ith bit of x is 1 of the ma- 
chine halts on the zth program, and is otherwise. The 
expression 

I( X : x) - K{x) - K{x | x) 

shows the amount of information in the halting sequence 
about the string x. (For an infinite sequence 77, we go back 
formally to the definition 1(7] : x) = K(x) — K(x \ 77) of p0|, 
since introducing a notion of rj* in place of r\ here has not 
been shown yet to bring any benefits.) We have 

J2^(x)2 I(x:x) =J22- K(x ^ < 1. 

X X 

Therefore, if we introduce a new quantity X'(r) related to 
X{r) defined by 

X'(r) = {x:I( X :x) >r}, 

then by Markov's inequality, 

Y m(a;)2 7 ^ < 2~ r . 

xGX'(r) 

That is, the universal probability of X'(r) is small. This 
is a new reason for X(r) to be small, as is shown in the 
following theorem. 

Theorem III. 24: We have 

I(X ■ x) > l(m x ) -2\ogl(m x ), 

and (essentially equivalently) X(r) C X'(r — 21ogr). 

Remark III. 25: The first item in the theorem implies: If 
l(m x ) > r, then I(x ■ x) > r — 21ogr. This in its turn 
implies the second item X{r) C X'(r — 21ogr). Similarly, 
the second item essentially implies the first item. Thus, 
a string for which the explicit minimal sufficient statistic 
has complexity much larger than K{k) (that is, l(m x ) is 
large) is exotic in the sense that it belongs to the kind of 
strings about which the halting sequence contains much 
information and vice versa: I(x '■ x) is large. <^> 
Proof: When we talk about complexity with x i n t ne 
condition, we use a Turing machine with x as an "oracle" . 
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With the help of Xi we can compute m x , and so we can 
define the following new semicomputable (relative to x) 
function with c — 6/tt 2 : 

v(x | x) = cm(x)2^ m ' ) /l(m x ) 2 . 



We have, using |III.23| and defining Y(r) = X(r) \ X(r + 1) 
so that l(m x ) — r for x £ Y(r): 

iev(r) ier(r) 

< cr- 2 2 r 2- r+2 < 4cr- 2 . 

Summing over r gives v(x \ x) < 4. The theorem that 
m(a;) = 2~- R "( a: - ) is maximal within multiplicative constant 
among semicomputable semimeasures is also true relative 
to oracles. Since we have established that v(x | x)/4 is a 
semicomputable semimeasure, therefore m(a; | x) > v i x 
x), or equivalently, 

K(x | x) < - \ogv{x | x) - K(x) - l(m x ) + 2logl(m x ), 
which proves the theorem. ■ 

IV. Non-Stochastic Objects 

In this section, whenever we talk about a description 
of a finite set S we mean an explicit description. This 
establishes the precise meaning of K(S), K(- \ S), m(S) = 
2~ K{ - S \ and m(- | S) = 2 K ^ S \ and so forth. 

Every data sample consisting of a finite string x has an 
sufficient statistic in the form of the singleton set {x}. Such 
a sufficient statistic is not very enlightening since it simply 
replicates the data and has equal complexity with x. Thus, 
one is interested in the minimal sufficient statistic that rep- 
resents the regularity, (the meaningful) information, in the 
data and leaves out the accidental features. This raises the 
question whether every x has a minimal sufficient statis- 
tic that is significantly less complex than x itself. At a 
Tallinn conference in 1973 Kolmogorov (according to p7| , 
[|}) raised the question whether there are objects x that 
have no minimal sufficient statistic that have relatively 
small complexity. In other words, he inquired into the ex- 
istence of objects that are not in general position (random 
with respect to) any finite set of small enough complexity, 
that is, "absolutely non-random" objects. Clearly, such ob- 
jects x have neither minimal nor maximal complexity: if 
they have minimal complexity then the singleton set {x} 
is a minimal sufficient statistic of small complexity, and if 
x e {0, 1}™ is completely incompressible (that is, it is indi- 
vidually random and has no meaningful information), then 
the uninformative universe {0, 1}™ is the minimal sufficient 
statistic of small complexity. To analyze the question bet- 
ter we need the technical notion of randomness deficiency. 

Define the randomness deficiency of an object x with 
respect to a finite set S containing it as the amount by 
which the complexity of x as an element of 5" falls short of 
the maximal possible complexity of an element in S when 
S is known explicitly (say, as a list): 



The meaning of this function is clear: most elements of 
S have complexity near log \S\, so this difference measures 
the amount of compressibility in x compared to the generic, 
typical, random elements of S. This is a generalization of 
the sufficiency notion in that it measures the discrepancy 
with typicality and hence sufficiency: if a set S is a suffi- 
cient statistic for x then 5s (x) = 0. 

We now continue the discussion of Kolmogorov's ques- 
tion. Shcn [|l7| gave a first answer by establishing the ex- 
istence of absolutely non-random objects x of length n, 
having randomness deficiency at least n — 2k — O(logfc) 
with respect to every finite set S of complexity K(S) < k 
that contains x. Moreover, since the set {x} has complex- 
ity K (x) and the randomness deficiency of x with respect 
to this singleton set is = 0, it follows by choice of k = K{x) 
that the complexity K(x) is at least n/2 — O(logn). 

Here we sharpen this result: We establish the existence 
of absolutely non-random objects x of length n, having ran- 
domness deficiency at least n — k with respect to every finite 
set S of complexity K(S n) < k that contains x. Clearly, 
this is best possible since x has randomness deficiency of 
at least n — K(S \ n) with every finite set S containing 
x, in particular, with complexity K(S | n) more than a 
fixed constant below n the randomness deficiency exceeds 
that fixed constant. That is, every sufficient statistic for x 
has complexity at least n. But if we choose S = {x} then 

-ft^S* | n) = K(x | n) < n, and, moreover, the randomness 
deficiency of x with respect to S is n — K(S \ n) = 0. To- 
gether this shows that the absolutely nonrandom objects x 
length n of which we established the existence have com- 
plexity K(x | n) = n, and moreover, they have significant 
randomness deficiency with respect to every set S contain- 
ing them that has complexity significantly below their own 
complexity n. 

A. Kolmogorov Structure Function 

We first consider the relation between the minimal un- 
avoidable randomness deficiency of x with respect to a set 
S containing it, when the complexity of S is upper bounded 
by a. These functional relations are known as Kolmogorov 
structure functions. Kolmogorov proposed a variant of the 
function 



h x (a) = min{ log 1^1 



xeS, K(S) < a}, 



(IV.2) 



where S C {0, 1}* is a finite set containing x, the contem- 
plated model for x, and a is a nonnegative integer value 
bounding the complexity of the contemplated S"s. He did 
not specify what is meant by K(S) but it was noticed im- 
mediately, as the paper jL8) points out, that the behavior of 
h x (a) is rather trivial if ^(5*) is taken to be the complex- 
ity of a program that lists S without necessarily halting. 
Section [II-D elaborates this point. So, the present section 



refers to explicit descriptions only. 

It is easy to see that for every increment d we have 



S s (x)=log\S\-K(x\S). 



(IV.l) 



h x (a + d) < \h x (a) - d + O(logd) 
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provided the right-hand side is non-negative, and other- 
wise. Namely, once we have an optimal set S a we can sub- 
divide it in any standard way into 2 d parts and take as S a +d 

the part containing x. Also, h x (a) — implies a > K(x), 
and, since the choice of S = {x} generally implies only 

a < K(x) is meaningful we can conclude a = K(x). There- 
fore it seems better advised to consider the function 



h x (a) + a - K(x) = nun{ log \S\ - (K(x) - a) 



K(S) <a} 



rather than ( IV.2f ) . For technical reasons related to the 
later analysis, we introduce the following variant of ran- 



domness deficiency (IV.l 



S* s (x)=log\S\-K(x\S,K(S)). 

The function h x {pt) + a — K(x) seems related to a func- 
tion of more intuitive appeal, namely (3 X (a) measuring the 
minimal unavoidable randomness deficiency of x with re- 
spect to every finite set S, that contains it, of complexity 
K(S) < a. Formally, we define 

P x (a) = m|n{ Ss(x) : K(S) < a }, 



and its variant 



(3*(a)=xmn{SUx):K(S)<a}, 



defined in terms of 8* s . Note that P x (K(x)) = P*(K(x)) = 
0. These /3-functions are related to, but different from, the 
P in @. 

To compare h and (3, let us confine ourselves to binary 
strings of length n. We will put n into the condition of all 
complexities. 

Lemma IV.l: (3* (a \ n) < h x (a \ n) + a — K(x \ n). 
Proof: Let S 3 x be a set with K(S \ n) < a and 
assume h x {a \ n) = log \ S\. Tacitly understand ing n in the 
conditions, and using the additivity property (ELI), 



K{x) -a< K{x) - K{S) < K(x, S) - K(S) 

= K(x | S,K(S)). 

Therefore 

h x {a) +a- K(x) = log \S\ - (K(x) - a) 

>log\S\-K(x\S,K(S))>P*(a). 



It would be nice to have an inequality also in the other 
direction, but we do not know currently what is the best 
that can be said. 

B. Sharp Bound on N on- Stochastic Objects 

We are now able to formally express the notion of non- 
stochastic objects using the Kolmogorov structure func- 
tions P x {a), P* (a). For every given k < n, Shen con- 
structed in [jl7j a binary string x of length n with K(x) < k 



and p x (k — 0(1)) > n — 2k — 0(logk). Let x be one of 
the non-stochastic objects of which the existence is estab- 
lished. Substituting k = K(x) we can contemplate the 
set S = {x} with complexity K(S) = k and x has ran- 
domness deficiency = with respect to S. This yields 
= p x (K(x)) >n~ 2K{x) - 0(\ogK(x)). Since it gener- 
ally holds that these non-stochastic objects have complex- 
ity K(x) > n/2 — O(logn), they are not random, typical, or 
in general position with respect to every set S containing 

them with complexity K(S) ^ n/2 — O(logn), but they are 
random, typical, or in general position only for sets S with 
complexity K(S) sufficiently exceeding n/2 — O(logn) like 
S = {x}. 

Here, we improve on this result, replacing n — 2k — 
O(logfc) with n — k and using /3* to avoid logarithmic 
terms. This is the best possible, since by choosing S = 
{0,1}™ we find log IS' - K(x | S,K(S)) = n - k, and 

hence /3*(c) < n — k for some constant c, which implies 

Px ( a ) < Px{c) < n — k for every a > c. 

Theorem IV. 2: There are constants ci,C2 such that for 
any given k < n there is a a binary string x of length n 
with K(x | n) < k such that for all a < k — c\ we have 

(3* (a | n) > n — k — C2 . 
In the terminology of (1.4), the theorem states that there 
are constants c\ , ci such that for every k < n there exists 
a string x of length n of complexity K(x \ n) < k that is 
not (k — ci, n — k — C2)-stochastic. 

Proof: Denote the conditional universal probability 
as m(5 | n) = 2- R{s \ n \ We write "5 3 x" to indicate sets 
S that satisfy x <E S. For every n, let us define a function 
over all strings x of length n as follows: 



u*(x | n) = Y, 

S3x, K(S\n)<i 



m(S | n) 



(IV.3) 



The following lemma shows that this function of a; is a 
semimeasure. 

Lemma IV.3: We have 



X 

Proof: We have 



l (x \n) < 1. 



(IV.4) 



E 



x S3x 1 1 

5^m(5|n)<l. 

s 



s xes 



Lemma IV. 4: There are constants ci,C2 such that for 
some x of length n, 



,<k-ci 



(x | n) < 2' 



(IV.5) 

k-c 2 < K(x \n)<k. (IV.6) 
Proof: Let us fix < c\ < k somehow, to be chosen 
appropriately later. Inequality ( IV.4| ) implies that there is 
an x with (IV. 5). Let x be the first string of length n with 
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this property. To prove the right inequality of ( IV. t ) , let p 
be the program of length < i = k~c\ that terminates last in 
the standard running of all these programs simultaneously 
in dovetailed fashion, on input n. We can use p and its 
length l{p) to compute all programs of length < l(p) that 
output finite sets using n. This way we obtain a list of 
all sets S with K(S \ n) < i. Using this list, for each 
y of lengt h n w e can compute v- l (y \ n), by using the 
definition ( [V.3| ) explicitly. Since x is defined as the first 
y with v- t (y \ n) < 2~ n , we can thus find x by using p 
and some program of constant length. If c\ is chosen large 
enough, then this implies K(x \ n) < k. 



On the other hand, from the definition (IV.3) we have 



p <K{{x}\n)( x | n j > 2 -K({x}\n)^ 

This implies, by the definition of x, that either K({x} \ 
n) > k ~ c\ or K({x} n) > n. Since K(x \ n) = K({x} | 
n)) we get the left inequality of ( [V.6 ) in both cases for an 
appropriate C2- I 
Consider now a new semicomputable function 



Hx,i{ S I n ) = 



2 n m{S | n) 
\S\ 



on all finite sets S 9 x with K(S n) < i. Then we have, 
with i — k — c\. 



S S3x, K(S\n)<i 

= Tv^(x | n) < 1 



m(5 | n) 



\S\ 



by (IV.3), (IV. 5), respectively, and so ^ x ,i(S \ n) withai,z,n 
fixed is a lower semicomputable semimeasure. By the dom- 
inating property we have m(S \ x, i, n) > fJ. x ^(S \ n). Since 
n is the length of x and i = k we can set K(S \ x, i, n) = 
K (S | x, k), and hence K(S \ x,k) < — log fi Xt i(S \ n). 



Then, with the first = because of (IV.6), 
K{S | x,K{x | n)) 

= K(S \x,k)< -logn x ,i(S | n) (IV.7) 
= log|5| -n + K(S | n). 

Then, by the additivity property (1L1) and ( [V.7| ): 

K(x | S,K(S | n),n) 

= K{x | n) + K{S | x, K(x \ n)) ~ K{S \ n) 
< k + log |5| - n. 

Hence 5*{x \ S,n) =\og\S\-K(x \ S,K(S \ n),n) > n-k. 



We are now in the position to prove Lemma |lll,18 



For every length n, there exist strings x of length n with 
K{x | n) = n for which {x} is an explicit minimal sufficient 

statistic. 

Proof: (of Lemma III. 18): Let x be one of the non- 



Theorem [V.2 . Choose x with K(x n) = k so that the 
set S = {x\ has complexity K(S n) — k — c\ and x has 
randomness deficiency = with respect to S. Because x is 
non-stochastic, this yields = (3*(k—ci \ n) > n—K(x \ n). 
For every x we have K (x \ n) < n. Together it follows 
that K(x | n) = n. That is, these non-stochastic objects 
x have complexity K(x \ n) = n. Nonetheless, there is a 
constant c' such that x is not random, typical, or in general 
position with respect to any explicitly represented finite set 
S containing it that has complexity K(S \ n) < n — c', but 
they are random, typical, or in general position for some 
sets S with complexity K(S \ n) > n like S — {x}. That 
is, every explicit sufficient statistic S for x has complexity 
K(S n) = n, and {x} is such a statistic. Hence {x} is an 
explicit minimal sufficient statistic for x. ■ 

V. Probabilistic Models 

It remains to generalize the model class from finite sets 
to the more natural and significant setting of probability 
distributions. Instead of finite sets the models are com- 
putable probability density functions P : {0, 1}* — > [0, 1] 
with ^2 P{ x ) ^ 1 — we allow defective probability distribu- 
tions where we may concentrate the surplus probability 
on a distinguished "undefined" element. "Computable" 
means that there is a Turing machine Tp that computes 
approximations to the value of P for every argument (more 
precise definition follows below). The (prefix-) complexity 
K(P) of a computable partial function P is defined by 

K(P) = min{K(i) : Turing machine Ti computes P}. 

i 

Equality ( pL^ ) now becomes 



K(x | P*) = -logP(a 



(V.l) 



and equality (HI. 4) becomes 



stochastic objects of which the existence is established by 



K(x) =K(P)-\ogP(x). 

As in the finite set case, the complexities involved are cru- 
cially dependent on what we mean by "computation" of 
P(x), that is, on the requirements on the format in which 
the output is to be represented. Recall from Q that Tur- 
ing machines can compute rational numbers: If a Turing 
machine T computes T(x), then we interpret the output 
as a pair of natural numbers, T(x) = (p, q), according to a 
standard pairing function. Then, the rational value com- 
puted by T is by definition p/q. The distinction between 
explicit and implicit description of P corresponding to the 
finite set model case is now defined as follows: 
« It is implicit if there is a Turing machine T computing 
P halting with rational value T(x) so that — logT(x) = 
-logP(x), and, furthermore, K(-\ogT(x) \ P*) = for 
x satisfying (V.l) — that is, for typical x. 
• It is explicit if the Turing machine T computing P, 
given x and a tolerance e halts with rational value so 
that — logT(a;) ~ —\og(P(x) ± e), and, furthermore, 
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K(-logT(x) P*) = for x satisfying ( |v7l|) — that is, 
for typical x. 

The implicit and explicit descriptions of finite sets and of 
uniform distributions with P(x) = 1/\S\ for all x G S and 
P{x) — otherwise, are as follows: An implicit (explicit) 
description of P is identical with an implicit (explicit) de- 
scription of S, up to a short fixed program which indicates 
which of the two is intended, so that K(P(x)) = K(S) for 
P{x) > (equivalently, x G S). 

To complete our discussion: the worst case of represen- 
tation format, a recursively enumerable approximation of 
P(x) where nothing is known about its value, would lead to 
indices — logP(x) of unknown length. We do not consider 
this case. 

The properties for the probabilistic models are loosely 
related_to the properties of finite set models by Proposi- 
We sharpen the relations by appropriately mod- 



1.2 



tion 

ifying the treatment of the finite set case, but essentially 
following the same course. 
We may use the notation 

Pmpi ; Pcxpl 

for some implicit and some explicit representation of P. 
When a result applies to both implicit and explicit rep- 
resentations, or when it is clear from the context which 
representation is meant, we will omit the subscript. 

A. Optimal Model and Sufficient Statistic 

As before, we distinguish between "models" that are 
computable probability distributions, and the "shortest 
programs" to compute those models that are finite strings. 

Consider a string x of length n and prefix complexity 
K(x) = k. We identify the structure or regularity in x 
that are to be summarized with a computable probability 
density function P with respect to which a; is a random or 
typical member. For x typical for P holds the following 
[ [To|| : Given an (implicitly or explicitly described) short- 
est program P* for P, a shortest binary program comput- 
ing x (that is, of length K{x \ P*)) can not be signif- 
icantly shorter than its Shannon-Fano code || of length 

-logP(ir), that is, K(x \ P*) > -logP(a;). By definition, 
we fix some agreed upon constant [3 > 0, and require 

K{x | P*) > -\ogP(x)-[3. 

As before, we will not indicate the dependence on [3 ex- 
plicitly, but the constants in all our inequalities (<) will be 
allowed to be functions of this [3. This definition requires 

a positive P{x). In fact, since K(x | P*) < K(x), it limits 
the size of P(x) to £l(2~ k ). The shortest program P* from 
which a probability density function P can be computed is 
an algorithmic statistic for x iff 



K(x | P*) = -logP(x). 



(V.2) 



There are two natural measures of suitability of such a 
statistic. We might prefer either the simplest distribution, 
or the largest distribution, as corresponding to the most 



likely structure 'explaining' x. The singleton probability 
distribution P(x) — 1, while certainly a statistic for x, 
would indeed be considered a poor explanation. Both mea- 
sures relate to the optimality of a two-stage description of 
x using P: 



K(x) < K(x, P) = K(P) + K(x | P*) 
< K(P) - log P(x), 



(V.3) 



where we rewrite K(x, P) by (|II.l| ). Here, P can be under- 
stood as either Pi mp i or P cxp i. Call a distribution P (with 
positive probability P{x)) for which 



K(x) = K(P) - logP(x), 



(V.4) 



optimal. (More precisely, we should require K(x) > 
K(P) — logP(x) — (3.) Depending on whether K(P) is 
understood as K{P lrav \) or K(P cyv \), our definition splits 
into implicit and explicit optimality. The shortest program 
for an optimal computable probability distribution is a al- 
gorithmic sufficient statistic for x. 

B. Properties of Sufficient Statistic 

As in the case of finite set models , we start with a se- 
quence of lemmas that are used to obtain the main results 
on minimal sufficient statistic. Several of these lemmas 
have two versions: for implicit distributions and for ex- 
plicit distributions. In these cases, P will denote P mp i or 
P oxp i respectively. 

Below it is shown that the mutual information between 
every typical distribution and the data is not much less 
than K(K(x)), the complexity of the complexity K(x) of 
the data x. For optimal distributions it is at least that, 
and for algorithmic minimal statistic it is equal to that. 
The log-probability of a typical distribution is determined 
by the following: 

Lemma V.l: Let k = K{x). If a distribution P is 
(implicitly or explicitly) typical for x then I(x : P) = 
k + \ogP(x). 

Proof: By definition I(x : P) = K[x) - K{x \ P*) 
and by typicality K(x | P*) = - logP(x). ■ 

The above lemma states that for (implicitly or explicitly) 
typical P the probability P(x) = e(2~ ( - k ~ I( - x:P ^). The 
next lemma asserts that for implicitly typical P the value 
I(x : P) can fall below K(k) by no more than an additive 
logarithmic term. 

Lemma V.2: Let k = K(x). If a distribution P is (im- 
plicitly or explicitly) typical for x then L(x : P) > K(k) — 

K(I(x : P)) and - \ogP(x) < k - K(k) + K(I(x : P)). 
(Here, P is understood as P mp i or P oxp i respectively.) 
Proof: Writing k — K(x), since 



k = K(k, x) = K{k) + K{x | k*) 



(V.5) 



by ( pTD , we have I(x : P) = K{x) - K(x | P*) = K(k) - 
[K(x | P*) - K(x | fe*)]. Hence, it suffices to show K(x 

P*) - K(x | k*) < K(I(x : P)). Now, from an implicit 
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description P* we can find the value = — \ogP(x) = k — 
I(x : P). To recover k from P* , we at most require an extra 

K(I(x : P)) bits. That is, K(k \ P*) < K(I(x : P)). This 

reduces what we have to show to K(x \ P *) < K(x \ k*) + 
K(k | P*) which is asserted by Theorem Q.l. This shows 
the first statement in the theorem. The second statement 
follows from the first one: rewrite I(x : P) = k + K(x \ P*) 
and substitute - logP(x) = K(x \ P*). ■ 

If we further restrict typical distributions to optimal ones 
then the possible positive probabilities assumed by distri- 
bution P are slightly restricted. First we show that im- 
plicit optimality with respect to some data is equivalent to 
typicality with respect to the data combined with effective 
constructability (determination) from the data. 

Lemma V. 3: A distribution P is (implicitly or explicitly) 
optimal for x iff it is typical and K(P \ x*) = 0. 



Proof: A distribution P is optimal iff (V.3) holds 
with equalities. Rewriting K(x,P) = K(x) + K(P \ x*) 
the first inequality becomes an equality iff K(P \ x*) = 0, 
and the second inequality becomes an equality iff K(x 



P* 



log P(x) (that is, P is a typical distribution) 



Lemma V.^: Let k — K(x). If a distribution P is (im- 
plicitly or explicitly) optimal for x, then I(x : P) = 
K(P) > K{k). and -logP(a:) <k-K(k). 

Proof: If P is optimal for x, then k — K(x) = K(P) + 
K(x | P*) = K(P) - logP(x). From P* we can find both 
K(P) = l(P*) and = -logP(a;), and hence k, that is, 

K{k) < K(P) We have I{x : P) = K{P) - K(P \ x*) = 
K(P) by ( pTl| ), Lemma V.3, respectively. This proves the 

first property. Subst itut ion of L(x : P) > K(k) in the 
expression of Lemma V.I proves the second property. ■ 

Remark V. 5: Our definitions of implicit and explicit de- 
scription format entail that, for typical x, one can compute 
= — \ogP(x) and — logP(ir), respectively, from P* alone 
without requiring x. An alternative possibility would have 
been that implicit and explicit description formats refer to 
the fact that we can compute = — logP(x) and — logP(a;), 
respectively, given both P and x. This would have added 
a —K(— logP(x) | P*) addit ive t erm in the righthand side 
of the expressions in Lemma V.2 and Lemma V.4. Clearly, 
this alternative definition is equal to the one we have cho- 
sen iff this term is always = for typical x. We now show 
that this is not the case. 

Note that for distributions that are uniform (or almost 
uniform) on a finite support we have K{— logP(a:) | P*) = 
0: In this borderline case the result specializes to that of 



Lemma III. 8 for finite set models, and the two possible 



definition types for implicitness and those for explicitness 
coincide. 

On the other end of the spectrum, for the definition type 
considered in this remark, the given lower bound on I(x : 
P) drops in case knowledge of P* doesn't suffice to com- 
pute -logP(a;), that is, if K(- log P(x) | P*) > for an 
statistic P* for x. The question is, whether we can exhibit 
such a probability distribution that is also computable? 



The answer turns out to be affirmative. By a result due 
to R. Solovay and P. Gacs, JhJ Exercise 3.7.1 on p. 225- 

226, there is a computable function f(x) > K(x) such that 
fix) = K(x) for infinitely many x. Considering the case 
of P optimal for x (a stronger assumption than that P is 
just typical) we have — logP(a;) = K(x) — K(P). Choosing 
P(x) such that — logP(a;) = log f{x) — K (P), we have that 
P(x) is computable since f(x) is computable and K(P) is 
a fixed constant. Moreover, there are infinitely many x's 
for which P is optimal, so K(— log P(x) \ P*) — » oo for 
x — » oo through this special sequence. 

C. Concrete Minimal Sufficient Statistic 

A simplest implicitly optimal distribution (that is, of 
least complexity) is an implicit algorithmic minimal suffi- 
cient statistic. As before, let S k = {y : K(y) < k}. Define 
the distribution P k (x) = l/\S k \ for x £ S k , and P k {x) = 
otherwise. The demonstration that P k (x) is an implicit al- 
gorithmic minimal sufficient statistic proccccds completely 



analogous to the finite set model setting, Corollary [11.13 
using the substitution K {- \ogP k {x) \ [P k )*) = 0. 

A similar equivalent construction suffices to obtain an 
explicit algorithmic minimal near-sufficient statistic for x, 
analogous to in the finite set model setting, Theo- 
rem [IIL21]. That Is, P^{y) = 1/|S*J for y e S*^, and 
otherwise. 

In general, one can develop the theory of minimal suffi- 
cient statistic for models that are probability distributions 
similarly to that of finite set models. 

D. Non-Quasistochastic Objects 

As in the more restricted case of finite sets, there are 
objects that are not typical for any explicitly computable 
probability distribution that has complexity significantly 
below that of the object itself. With the terminology of 
(|i~5|), we may call such absolutely non-quasistochastic. 

By Proposition 1.2, item (b), there are constants c and 
C such that if x is not (a + clogrt, + C)-stochastic (1.4) 
then x is not (a, /?)-quasistochastic (|i~5|). Substitution in 
Theorem IV. 2 yields: 



Corollary V.6: There are constants c, C such that, for 
every k < n, there are constants ci, ci and a binary string 
x of length n with K(x \ n) < k such that x is not (k — 
clogn — c\,n — k — C — C2)-quasistochastic. 

As a particular consequence: Let x with length n be one 
of the non-quasistochastic strings of which the existence 



is established by Corollary V.6. Substituting K(x \ n) < 
k — clogrt, we can contemplate the distribution P x {y) = 1 
for y = x and and otherwise. Then we have complexity 
K(P X | n) = K(x | n). Clearly, x has randomness defi- 
ciency = with respect to P x . Because of the assumption 
of non-quasistochasticity of x, and because the minimal 
randomness-deficiency = n — k of x is always nonnegative, 

= n — k > n — K(x \ n) — clogn. Since it generally 

holds that K(x \ n) < n, it follows that n > K(x \ n) > 
ii — clogn. That is, these non-quasistochastic objects have 
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complexity K(x \ n) = n — O(logro) and are not random, 
typical, or in general position with respect to any explicitly 
computable distribution P with P(x) > and complexity 

K(P | n) < n — (c + I)logn, but they are random, typ- 
ical, or in general position only for some distributions P 
with complexity K(P \ n) > n — clogn like P x . That 
is, every explicit sufficient statistic P for x has complexity 
K{P | n) > n — clogn, and P x is such a statistic. 

VI. Algorithmic Versus Probabilistic 

Algorithmic sufficient statistic, a function of the data, 
is so named because intuitively it expresses an individual 
summarizing of the relevant information in the individ- 
ual data, reminiscent of the probabilistic sufficient statistic 
that summarizes the relevant information in a data random 
variable about a model random variable. Formally, how- 
ever, previous authors have not established any relation. 
Other algorithmic notions have been successfully related 
to their probabilistic counterparts. The most significant 
one is that for every computable probability distribution, 
the expected prefix complexity of the objects equals the en- 
tropy of the distribution up to an additive constant term, 
related to the complexity of the distribution in question. 



We have used this property in (II.4) to establish a similar 
relation between the expected algorithmic mutual informa- 
tion and the probabilistic mutual information. We use this 
in turn to show that there is a close relation between the 
algorithmic version and the probabilistic version of suffi- 
cient statistic: A probabilistic sufficient statistic is with 
high probability a natural conditional form of algorithmic 
sufficient statistic for individual data, and, conversely, that 
with high probability a natural conditional form of algo- 
rithmic sufficient statistic is also a probabilistic sufficient 
statistic. 

Recall the terminology of probabilistic mutual informa- 
tion ( [P| ) and probabilistic sufficient statistic (1.2). Con- 
sider a probabilistic ensemble of models, a family of com- 
putable probability mass functions {fg} indexed by a dis- 
crete parameter 9, together with a computable distribution 
Pi over 9. (The finite set model case is the restriction where 
the /g's are restricted to uniform distributions with finite 
supports.) This way we have a random variable 8 with 
outcomes in {fg} and a random variable X with outcomes 
in the union of domains of fg, and p{9,x) — p\(9)fg(x) is 
computable. 

Notation VI.l: To compare the algorithmic sufficient 
statistic with the probabilistic sufficient statistic it is con- 
venient to denote the sufficient statistic as a function S(-) 
of the data in both cases. Let a statistic S(x) of data x 
be the more general form of probability distribution as in 
Section [v|. That is, S maps the data x to the parameter 
p that determines a probability mass function f p (possibly 
not an element of Jfe})- Note that "/p(0" corresponds to 
"P(-)" in Section [V| If f p is computable, then this can be 
the Turing machine T p that computes f p . Hence, in the 
current section, u S(x)" denotes a probability distribution, 
say f p , and "/ p (x)" is the probability f p concentrates on 



data x. 

Remark VI.2: In the probabilistic statistics setting, Ev- 
ery function T{x) is a statistic of x, but only some of them 
are a sufficient statistic. In the algorithmic statistic setting 
we have a quite similar situation. In the finite set statistic 
case S(x) is a finite set, and in the computable probabil- 
ity mass function case S(x) is a computable probability 
mass function. In both algorithmic cases we have shown 
K(S(x) | x*) = Q for S(x) is an implicitly or explicitly 
described sufficient statistic. This means that the number 
of such sufficient statistics for x is bounded by a universal 
constant, and that there is a universal program to compute 
all of them from x* — and hence to compute the minimal 
sufficient statistic from x*. <3> 

Lemma VI.3: Let p(9,x) = p\(9)fg(x) be a computable 
joint probability mass function, and let S be a function. 
Then all three conditions below are equivalent and imply 
each other: 

(i) S is a probabilistic sufficient statistic (in the form 
I(Q,X)±I(Q,S(X))). 

(ii) S satisfies 

x)I(9 : x) ± 5>(0, x)I{9 : S{x)) (VI.l) 
9,x e,x 

(iii) S satisfies 

1(0; X) ± 1(0; S(X)) ± ^p(0,x)I(6 : x) 

e,x 

^^p(0,x)I(0:S(x)). 

e,x 

All = signs hold up to an = ±2K(p) constant additive 
term. 

Proof: Clearly, (iii) implies (i) and (ii). 
We sh ow that both (i) implies (iii) and (ii) implies (iii): 
By (II.4) we have 



/(e;V)^5>(MK(^), (VI.2) 

e,x 

1(0; S(X)) : S(x)), 



where we abso rb a ±2 K(p) additive term in the = sign. 
Together with ( |VI.l|) , (|VI.2|) implies 



and 



vice versa 



I(Q;X)±I(Q;S(X)); (VI.3) 
( |yi.3|) together with ( |VI.2| ) implies ( |VI.l| ). 



Remark VI. 4: It may be worth stressing that S in The- 
orem VI.3 can be any function, without restriction. 

Remark VI. 5: Note that (VI.3) involves equality = 
rather than precise equality as in the definition of the prob- 
abilistic sufficient statistic (1.2). (} 

Definition VI. 6: Assume the terminology and notation 
above. A statistic S for data x is 9 -sufficient with deficiency 
5iiI(d,x)=I(6,S(x))+S. 
9-sufficient statistic. 



If 5 = then S(x) is simply a 
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The following lemma shows that ^-sufficiency is a type 
of conditional sufficiency: 

Lemma VI. 7: Let S(x) be a sufficient statistic for x. 
Then, 

K(x | 9*) + 6 ± K(S(x) | 9*) - \ogS{x). (VIA) 

iff 1(9, x) ^I(9,S(x))+S. 

Proof: (If) By assumption, K(S(x)) - K(S(x) \ 9*) + 
S = K(x)-K(x | 9*). Rearrange and add ~K(x | S(x)*)- 
log S(x) = (by typicality) to the right-hand side to obtain 
K(x | 9*) + K(S(x)) ± K(S(x) | 9*) + K(x) - K(x 
S(x)*) — \ogS(x) — S. Substitute according to K(x) = 
K(S(x)) + K(x | S(x)*) (by sufficiency) in the right-hand 
side, and s ubseq uently subtract K(S(x)) from both sides, 
to obtain ( |VI4|) . 

(Only If) Reverse the proof of the (If) case. 

■ 

The following theorems state that S(X) is a probabilis- 
tic sufficient statistic iff S(x) is an algorithmic ^-sufficient 
statistic, up to small deficiency, with high probability. 

Theorem VI. 8: Let p(9,x) = p\(9)fg(x) be a com- 
putable joint probability mass function, and let S be a 
function. If S is a recursive probabilistic sufficient statis- 
tic, then S is a 0-sufficient statistic with deficiency O(k), 
with p-probability at least I — \ . 

Proof: If S is a probabilistic sufficient st atisti c, then, 
by Lemma VL3j , equality of p-expectations ( VI.l ) holds. 
However, it is still consistent with this to have large pos- 
itive and negative differences 1(9 : x) — 1(9 : S(x)) for 
different (9, x) arguments, such that these differences can- 
cel each other. This problem is resolved by appeal to 



the algorithmic mutual information non-increase law (II. 6) 



which shows that all differences are essentially positive: 
1(9 : x) - 1(9 : S(x)) > -K(S). Altogether, let ci,c 2 be 
least positive constants such that 1(9 : x) — 1(9 : S(x)) + c\ 
is always nonnegative and its p-expectation is C2- Then, by 
Markov's inequality, 

p(I(9:x)-I(9:S(x))>kc 2 - Cl )<^, 

that is, 



p(I(9 : x) - 1(9 : S(x)) < kc 2 - Ci) > 1 



Theorem VI. 9: For each n, consider the set of data x of 
length n. Let p(9,x) = p\(9)fg(x) be a computable joint 
probability mass function, and let S be a function. If S is 
an algorithmic 0-sufficient statistic for x, with p-probability 
at least 1 — e (1/e = n + 21ogrt), then S is a probabilistic 
sufficient statistic. 

Proof: By assumption, using Definition VI. 6, there is 
a positive constant ci, such that, 

p(\I(9 : x) - 1(9 : S(x))\ < c x ) > 1 - e. 



Therefore, 

0< p(9,x)\I(9:x)-I(9:S(x))\ 

\I(8:x)-I(9:S(x))\<c l 



< (1 - e)ci = 0. 



On the other hand, since 



1/e > n + 2 logn > K(x) > max 1(9; x), 

0,x 



we obtain 

0< J2 p(9,x)\I(9:x)-I(9:S(x))\ 

\I{6:x)-I{0:S{x))\>d 



< e(n + 2 logn) < 0. 



Altogether, this implies ( VI.l ), and by Lemma VI. 3 , the 
theorem. ■ 

VII. Conclusion 

An algorithmic sufficient statistic is an individual finite 
set (or probability distribution) for which a given individ- 
ual sequence is a typical member. The theory is formulated 
in Kolmogorov's absolute notion of the quantity of infor- 
mation in an individual object. This is a notion analogous 
to, and in some sense sharper than the probabilistic notion 
of sufficient statistic — an average notion based on the en- 
tropies of random variables. It turned out, that for every 
sequence x we can determine the complexity range of pos- 
sible algorithmic sufficient statistics, and, in particular, ex- 
hibit a algorithmic minimal sufficient statistic. The manner 
in which the statistic is effectively represented is crucial: we 
distinguish implicit representation and explicit representa- 
tion. The latter is essentially a list of the elements of a 
finite set or a table of the probability density function; the 
former is less explicit than a list or table but more explicit 
than just recursive enumeration or approximation in the 
limit. The algorithmic minimal sufficient statistic can be 
considerably more complex depending on whether we want 
explicit or implicit representations. We have shown that 
there are sequences that have no simple explicit algorith- 
mic sufficient statistic: the algorithmic minimal sufficient 
statistic is essentially the sequence itself. Note that such se- 
quences cannot be random in the sense of having maximal 
Kolmogorov complexity — in that case already the simple 
set of all sequences of its length, or the corresponding uni- 
form distribution, is an algorithmic sufficient statistic of 
almost zero complexity. We demonstrated close relations 
between the probabilistic notions and the corresponding al- 
gorithmic notions: (i) The average algorithmic mutual in- 
formation is equal to the probabilistic mutual information, 
(ii) To compare algorithmic sufficient statistic and prob- 
abilistic sufficient statistic meaningfully one needs to con- 
sider a conditional version of algorithmic sufficient statistic. 
We defined such a notion and demonstrated that proba- 
bilistic sufficient statistic is with high probability an (ap- 
propriately conditioned) algorithmic sufficient statistic and 
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vice versa. The most conspicuous theoretical open end is 
as follows: For explicit descriptions we were only able to 
guarantee a algorithmic minimal near-sufficient statistic, 
although the construction can be shown to be minimal suf- 
ficient for almost all sequences. One would like to obtain 
a concrete example of a truly explicit algorithmic minimal 
sufficient statistic. 

A. Subsequent Work 

One can continue generalization of model classes for al- 
gorithmic statistic beyond computable probability mass 
functions. The ultimate model class is the set of recur- 
sive functions. In the manuscript |Q, provisionally entitled 
"Sophistication Revisited" , the following results have been 
obtained. For the set of partial recursive functions the min- 
imal sufficient statistic has complexity = for all data x. 
One can define equivalents of the implicit and explicit de- 
scription format in the total recursive function setting. We 
obtain various upper and lower bounds on the complexities 
of the minimal sufficient statistic in all three description 
formats. The complexity of the minimal sufficient statis- 
tic for x, in the model class of total recursive functions, 
is called its "sophistication." Hence, one can distinguish 
three different sophistications corresponding to the three 
different description formats: explicit, implicit, and unre- 
stricted. It turns out that the sophistication functions are 
not recursive; the Kolmogorov prefix complexity can be 
computed from the minimal sufficient statistic (every de- 
scription format) and vice versa; given the minimal suffi- 
cient statistic as a function of x one can solve the so-called 
"halting problem" {Tofl ; and the sophistication functions 
are upper semicomputable. By the same proofs, such com- 
putability properties also hold for the minimal sufficient 
statistics in the model classes of finite sets and computable 
probability mass functions. 

B. Application 

Because the Kolmogorov complexity is not computable, 
an algorithmic sufficient statistic cannot be computed ei- 
ther. Nonetheless, the analysis gives limits to what is 
achievable in practice — like in the cases of coding theo- 
rems and channel capacities under different noise models 
in Shannon information theory. The theoretical notion of 
algorithmic sufficient statistic forms the inspiration to de- 
velop applied models that can be viewed as computable 
approximations. Minimum description length (MDL),[||, is 
a good example; its relation with the algorithmic minimal 
sufficient statistic is given in |20) . As in the case of ordinary 
probabilistic statistic, algorithmic statistic if applied unre- 
strained cannot give much insight into the meaning of the 
data; in practice one must use background information to 
determine the appropriate model class first — establishing 
what meaning the data can have — and only then apply al- 
gorithmic statistic to obtain the best mod el in that class 
by optimizing its parameters. Sec Example [II. 5. Nonethe- 



less, in applications one can sometimes still unrestrictedly 
use compression properties for model selection, for example 
by a judicious choice of model parameter to optimize. One 



example is the precision at which we represent the other 
parameters: too high precision causes accidental noise to 
be modeled as well, too low precision may cause models 
that should be distinct to be confusing. In general, the 
performance of a model for a given data sample depends 
critically on what we may call the "degree of discretization" 
or the "granularity" of the model: the choice of precision 
of the parameters, the number of nodes in the hidden layer 
of a neural network, and so on. The granularity is often 
determined ad hoc. In in two quite different experimen- 
tal settings the best model granularity values predicted by 
MDL are shown to coincide with the best values found ex- 
perimentally. 
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